<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: varun varde</title>
    <description>The latest articles on DEV Community by varun varde (@varunvarde).</description>
    <link>https://dev.to/varunvarde</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3761696%2Fecce1536-897c-45d3-ba4f-11e7d0b344ed.jpg</url>
      <title>DEV Community: varun varde</title>
      <link>https://dev.to/varunvarde</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/varunvarde"/>
    <language>en</language>
    <item>
      <title>Team Topologies for DevOps: A Practical Implementation Guide</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Thu, 21 May 2026 11:35:55 +0000</pubDate>
      <link>https://dev.to/varunvarde/team-topologies-for-devops-a-practical-implementation-guide-16on</link>
      <guid>https://dev.to/varunvarde/team-topologies-for-devops-a-practical-implementation-guide-16on</guid>
      <description>&lt;p&gt;Most engineering organisations do not fail because their developers are untalented.&lt;/p&gt;

&lt;p&gt;They fail because their communication structures, ownership boundaries, and operational dependencies create friction that compounds over time.&lt;/p&gt;

&lt;p&gt;A deployment takes three weeks because four teams must approve it. A platform team becomes a ticket queue instead of a product team. Stream-aligned teams spend more time negotiating dependencies than shipping software. Cognitive overload silently accumulates until incident frequency rises and delivery velocity collapses.&lt;/p&gt;

&lt;p&gt;These are not tooling problems.&lt;/p&gt;

&lt;p&gt;They are topology problems.&lt;/p&gt;

&lt;p&gt;The framework introduced in the book Team Topologies by Matthew Skelton and Manuel Pais provides one of the clearest operational models for designing engineering organisations around flow rather than hierarchy.&lt;/p&gt;

&lt;p&gt;The core idea is deceptively simple&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Optimise team structures for fast, sustainable software delivery.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This article explains how to apply Team Topologies in practice, identify the organisational anti-patterns slowing your DevOps initiatives, and implement structural changes that improve delivery speed without creating organisational chaos.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Team Structure Matters in DevOps
&lt;/h2&gt;

&lt;p&gt;DevOps is often described as a tooling movement.&lt;/p&gt;

&lt;p&gt;It is not.&lt;/p&gt;

&lt;p&gt;It is fundamentally a sociotechnical systems discipline.&lt;/p&gt;

&lt;p&gt;Tooling matters. Automation matters. CI/CD matters.&lt;/p&gt;

&lt;p&gt;But organisational communication paths ultimately determine delivery speed.&lt;/p&gt;

&lt;p&gt;Conway’s Law famously states:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Organisations design systems that mirror their communication structures.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fragmented teams create fragmented systems&lt;/li&gt;
&lt;li&gt;Bottlenecked organisations create bottlenecked architectures&lt;/li&gt;
&lt;li&gt;High-friction communication creates high-friction delivery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Team Topologies provides a practical framework for reducing those organisational bottlenecks systematically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4 Team Types
&lt;/h2&gt;

&lt;p&gt;The Team Topologies model defines four fundamental team types.&lt;/p&gt;

&lt;p&gt;Each exists to solve a distinct operational problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Stream-Aligned Teams&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;These are the primary delivery teams.&lt;/p&gt;

&lt;p&gt;A stream-aligned team owns a flow of business value end-to-end.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Payments platform&lt;/li&gt;
&lt;li&gt;Customer onboarding&lt;/li&gt;
&lt;li&gt;Mobile checkout&lt;/li&gt;
&lt;li&gt;Recommendation engine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key principle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Single team → owns service lifecycle completely
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Development&lt;/li&gt;
&lt;li&gt;Deployment&lt;/li&gt;
&lt;li&gt;Operations&lt;/li&gt;
&lt;li&gt;Monitoring&lt;/li&gt;
&lt;li&gt;Incident response&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Characteristics of Strong Stream-Aligned Teams
&lt;/h2&gt;

&lt;p&gt;Healthy stream-aligned teams typically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy independently&lt;/li&gt;
&lt;li&gt;Own production support&lt;/li&gt;
&lt;li&gt;Minimise external dependencies&lt;/li&gt;
&lt;li&gt;Have clear business alignment&lt;/li&gt;
&lt;li&gt;Operate autonomously&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example structure&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Payments&lt;/span&gt;
&lt;span class="na"&gt;Ownership&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Payment API&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Fraud checks&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Transaction database&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Deployment pipelines&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Monitoring dashboards&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This dramatically reduces coordination overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warning Signs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stream-aligned teams fail when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Too many systems are owned&lt;/li&gt;
&lt;li&gt;Multiple domains are mixed together&lt;/li&gt;
&lt;li&gt;External dependencies dominate delivery&lt;/li&gt;
&lt;li&gt;Teams lack operational authority&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is cognitive overload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Enabling Teams&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Enabling teams exist to help other teams improve capabilities.&lt;/p&gt;

&lt;p&gt;Not to permanently do the work for them.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes adoption team&lt;/li&gt;
&lt;li&gt;SRE coaching team&lt;/li&gt;
&lt;li&gt;Security enablement team&lt;/li&gt;
&lt;li&gt;Observability specialists&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Their role is temporary acceleration.&lt;/p&gt;

&lt;p&gt;Not long-term ownership.&lt;/p&gt;

&lt;h2&gt;
  
  
  Healthy Enabling Team Behaviour
&lt;/h2&gt;

&lt;p&gt;Good enabling teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teach&lt;/li&gt;
&lt;li&gt;Coach&lt;/li&gt;
&lt;li&gt;Pair&lt;/li&gt;
&lt;li&gt;Document&lt;/li&gt;
&lt;li&gt;Reduce friction&lt;/li&gt;
&lt;li&gt;Transfer knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bad enabling teams become outsourced implementation departments.&lt;/p&gt;

&lt;p&gt;That destroys scalability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example: Kubernetes Enablement
&lt;/h2&gt;

&lt;p&gt;Good pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Enabling Team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Creates templates&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Runs workshops&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Helps first deployments&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Coaches incident response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad pattern&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Every Kubernetes deployment requires enabling team intervention forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That becomes another bottleneck.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Complicated Subsystem Teams
&lt;/h2&gt;

&lt;p&gt;Some domains require deep specialist expertise.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ML inference systems&lt;/li&gt;
&lt;li&gt;Real-time video encoding&lt;/li&gt;
&lt;li&gt;Cryptography engines&lt;/li&gt;
&lt;li&gt;High-frequency trading systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are cognitively dense domains unsuitable for broad ownership.&lt;/p&gt;

&lt;p&gt;Dedicated specialist teams reduce complexity exposure for the rest of the organisation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Team Type Exists
&lt;/h2&gt;

&lt;p&gt;Without complicated subsystem teams&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Every stream-aligned team
↓
Must understand advanced specialist systems
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This overwhelms cognitive capacity rapidly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A recommendation-engine ML platform might require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tensor optimisation&lt;/li&gt;
&lt;li&gt;GPU scheduling&lt;/li&gt;
&lt;li&gt;Feature stores&lt;/li&gt;
&lt;li&gt;Embedding pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That expertise does not belong inside every product team.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Platform Teams
&lt;/h2&gt;

&lt;p&gt;Platform teams build internal developer platforms.&lt;/p&gt;

&lt;p&gt;Their mission&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reduce cognitive load for stream-aligned teams.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Platform teams should operate like product teams.&lt;/p&gt;

&lt;p&gt;Not internal ticket queues.&lt;/p&gt;

&lt;h2&gt;
  
  
  Platform Team Responsibilities
&lt;/h2&gt;

&lt;p&gt;Typical responsibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CI/CD systems&lt;/li&gt;
&lt;li&gt;Kubernetes platforms&lt;/li&gt;
&lt;li&gt;Observability tooling&lt;/li&gt;
&lt;li&gt;Secrets management&lt;/li&gt;
&lt;li&gt;Golden deployment paths&lt;/li&gt;
&lt;li&gt;Infrastructure templates&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Platform-as-a-Product
&lt;/h2&gt;

&lt;p&gt;This concept is critical.&lt;/p&gt;

&lt;p&gt;A healthy platform team provides&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Self-service capabilities
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not manual intervention.&lt;/p&gt;

&lt;p&gt;Good platform&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer clicks button → environment created
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad platform&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer opens Jira ticket → waits 2 weeks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The 3 Interaction Modes
&lt;/h2&gt;

&lt;p&gt;The framework also defines three interaction patterns between teams.&lt;/p&gt;

&lt;p&gt;These interaction modes are enormously important operationally.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Collaboration Mode
&lt;/h2&gt;

&lt;p&gt;Temporary close cooperation between teams.&lt;/p&gt;

&lt;p&gt;Used for:&lt;/p&gt;

&lt;p&gt;New capability adoption&lt;br&gt;
Complex integrations&lt;br&gt;
Discovery work&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Payments Team ↔ Platform Team
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Working together to implement service mesh adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Word: Temporary
&lt;/h2&gt;

&lt;p&gt;Permanent collaboration indicates unclear boundaries.&lt;/p&gt;

&lt;p&gt;Collaboration mode should end eventually.&lt;/p&gt;

&lt;p&gt;Otherwise dependency chains become permanent.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. X-as-a-Service Mode
&lt;/h2&gt;

&lt;p&gt;One team provides services consumed independently by others.&lt;/p&gt;

&lt;p&gt;This is the desired long-term state for platform teams.&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Platform Team → Kubernetes Platform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Consumed self-service by product teams.&lt;/p&gt;

&lt;p&gt;Minimal synchronous interaction required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signs Your Platform Interface Is Healthy
&lt;/h2&gt;

&lt;p&gt;Healthy X-as-a-Service characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Well documented&lt;/li&gt;
&lt;li&gt;Self-service&lt;/li&gt;
&lt;li&gt;Stable APIs&lt;/li&gt;
&lt;li&gt;Clear support boundaries&lt;/li&gt;
&lt;li&gt;Minimal tickets required&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Facilitating Mode
&lt;/h2&gt;

&lt;p&gt;Used by enabling teams.&lt;/p&gt;

&lt;p&gt;Purpose&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Teach capability
Not own capability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Security workshops&lt;/li&gt;
&lt;li&gt;Incident response coaching&lt;/li&gt;
&lt;li&gt;Terraform migration guidance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Facilitating mode transfers knowledge intentionally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Assessing Your Current Topology: The 6 Key Questions
&lt;/h2&gt;

&lt;p&gt;Most organisations already feel their topology pain intuitively.&lt;/p&gt;

&lt;p&gt;This framework helps diagnose it systematically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 1: How Many Teams Are Required for a Deployment?
&lt;/h2&gt;

&lt;p&gt;If the answer exceeds three consistently&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Flow efficiency is already degraded.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Question 2: Are Platform Teams Productive or Ticket-Driven?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Platform teams buried in support queues are usually under-designed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 3: Is Production Ownership Clear?
&lt;/h2&gt;

&lt;p&gt;During incidents&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Who owns this?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Should never require debate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 4: How Much Cognitive Load Exists Per Team?
&lt;/h2&gt;

&lt;p&gt;Too many technologies, domains, or dependencies create delivery paralysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 5: How Often Are Teams Waiting on Other Teams?
&lt;/h2&gt;

&lt;p&gt;Dependency-heavy organisations slow exponentially as headcount grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question 6: Are Teams Optimised Around Technology or Business Flow?
&lt;/h2&gt;

&lt;p&gt;Technology-aligned teams often create excessive handoffs.&lt;/p&gt;

&lt;p&gt;Business-stream alignment improves delivery velocity dramatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cognitive Load Assessment Framework
&lt;/h2&gt;

&lt;p&gt;Example survey structure&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;COGNITIVE_LOAD_SURVEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain_complexity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How well does the team understand the business domain?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red_flag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt; 3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;

    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;technology_breadth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How many distinct technologies are maintained?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red_flag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt; 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;

    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dependency_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How many teams are required per sprint?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red_flag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt; 3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This kind of lightweight operational telemetry is surprisingly valuable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Most Common Team Topologies Anti-Patterns
&lt;/h2&gt;

&lt;p&gt;Most engineering organisations fail in recognisable ways.&lt;/p&gt;

&lt;p&gt;The same patterns appear repeatedly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Pattern 1: The Shared Services Team Bottleneck
&lt;/h2&gt;

&lt;p&gt;Classic example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Shared DevOps Team
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CI/CD&lt;/li&gt;
&lt;li&gt;Kubernetes&lt;/li&gt;
&lt;li&gt;Terraform&lt;/li&gt;
&lt;li&gt;Monitoring&lt;/li&gt;
&lt;li&gt;Networking&lt;/li&gt;
&lt;li&gt;Security&lt;/li&gt;
&lt;li&gt;Deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For every product team.&lt;/p&gt;

&lt;p&gt;Result&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Centralised dependency bottleneck
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Symptoms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long ticket queues&lt;/li&gt;
&lt;li&gt;Slow onboarding&lt;/li&gt;
&lt;li&gt;Deployment delays&lt;/li&gt;
&lt;li&gt;Platform burnout&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Real Cost
&lt;/h2&gt;

&lt;p&gt;Shared services teams often become&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Organisational rate limiters
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every engineering initiative slows behind them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Better Model
&lt;/h2&gt;

&lt;p&gt;Replace shared services with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stream-aligned ownership&lt;/li&gt;
&lt;li&gt;Self-service platforms&lt;/li&gt;
&lt;li&gt;Enabling teams&lt;/li&gt;
&lt;li&gt;Platform-as-product&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Anti-Pattern 2: Platform Teams Without a Defined Interface
&lt;/h2&gt;

&lt;p&gt;Many platform teams say&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"We provide Kubernetes."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But what does that actually mean operationally?&lt;/p&gt;

&lt;p&gt;Healthy platforms define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;Golden paths&lt;/li&gt;
&lt;li&gt;Support models&lt;/li&gt;
&lt;li&gt;Service expectations&lt;/li&gt;
&lt;li&gt;Onboarding flows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without interfaces&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Platform becomes tribal knowledge.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Anti-Pattern 3: Enabling Teams That Never Stop Enabling
&lt;/h2&gt;

&lt;p&gt;Enabling teams should create independence.&lt;/p&gt;

&lt;p&gt;Not permanent dependency.&lt;/p&gt;

&lt;p&gt;Danger signs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teams require constant coaching forever&lt;/li&gt;
&lt;li&gt;Knowledge transfer never completes&lt;/li&gt;
&lt;li&gt;Enablement becomes embedded implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point the enabling team has failed structurally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Pattern 4: Cognitive Load Mismatches
&lt;/h2&gt;

&lt;p&gt;This is one of the most damaging failure modes.&lt;/p&gt;

&lt;p&gt;Teams own too much simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple languages&lt;/li&gt;
&lt;li&gt;Multiple databases&lt;/li&gt;
&lt;li&gt;Infrastructure&lt;/li&gt;
&lt;li&gt;Security&lt;/li&gt;
&lt;li&gt;CI/CD&lt;/li&gt;
&lt;li&gt;ML systems&lt;/li&gt;
&lt;li&gt;Distributed systems complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Eventually&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Incident frequency rises
Delivery speed drops
Burnout accelerates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Measuring Cognitive Load
&lt;/h2&gt;

&lt;p&gt;Indicators include&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Warning Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Technologies maintained&lt;/td&gt;
&lt;td&gt;&amp;gt; 5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Teams depended on&lt;/td&gt;
&lt;td&gt;&amp;gt; 3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incident ambiguity&lt;/td&gt;
&lt;td&gt;Frequent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment complexity&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Documentation quality&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cognitive overload is usually visible before collapse occurs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Planning a Topology Change
&lt;/h2&gt;

&lt;p&gt;Topology redesign is organisational surgery.&lt;/p&gt;

&lt;p&gt;Done poorly, it creates chaos.&lt;/p&gt;

&lt;p&gt;Done carefully, it dramatically improves flow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Identify Friction Points
&lt;/h2&gt;

&lt;p&gt;Start with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployment delays&lt;/li&gt;
&lt;li&gt;Dependency bottlenecks&lt;/li&gt;
&lt;li&gt;Ticket queues&lt;/li&gt;
&lt;li&gt;Incident ownership confusion&lt;/li&gt;
&lt;li&gt;Platform dissatisfaction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Map flow disruptions explicitly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Reduce Team Dependencies
&lt;/h2&gt;

&lt;p&gt;Optimise for&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Independent delivery capability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dependency reduction is usually the highest-ROI organisational improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Define Platform Interfaces
&lt;/h2&gt;

&lt;p&gt;Every platform capability should answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who uses this?&lt;/li&gt;
&lt;li&gt;How is it consumed?&lt;/li&gt;
&lt;li&gt;Is it self-service?&lt;/li&gt;
&lt;li&gt;What are support expectations?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 4: Transition Gradually
&lt;/h2&gt;

&lt;p&gt;Never reorganise everything simultaneously.&lt;/p&gt;

&lt;p&gt;Recommended approach&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pilot topology
↓
Measure outcomes
↓
Expand incrementally
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Organisational stability matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring the Impact
&lt;/h2&gt;

&lt;p&gt;Topology changes should produce measurable improvements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delivery Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deployment frequency&lt;/td&gt;
&lt;td&gt;Measures flow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lead time&lt;/td&gt;
&lt;td&gt;Measures delivery friction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MTTR&lt;/td&gt;
&lt;td&gt;Measures operational clarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change failure rate&lt;/td&gt;
&lt;td&gt;Measures stability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These align closely with DORA metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cognitive Load Surveys
&lt;/h2&gt;

&lt;p&gt;Run quarterly.&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;red_flags&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Urgent restructuring required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even lightweight surveys reveal structural problems surprisingly well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Platform Satisfaction Scores
&lt;/h2&gt;

&lt;p&gt;Ask stream-aligned teams&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;How frictionless is the platform?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single question often exposes platform dysfunction rapidly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example Topology Transformation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developers
↓
Shared DevOps Team
↓
Infrastructure Team
↓
Security Team
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Heavy coordination overhead.&lt;/p&gt;

&lt;p&gt;Slow deployments.&lt;/p&gt;

&lt;p&gt;Unclear ownership.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stream-Aligned Teams
        ↓
Self-Service Platform
        ↓
Enabling Teams
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Much faster flow.&lt;/p&gt;

&lt;p&gt;Reduced dependencies.&lt;/p&gt;

&lt;p&gt;Improved operational autonomy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes During Team Topologies Adoption
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Mistake 1: Renaming Teams Without Changing Responsibilities
&lt;/h2&gt;

&lt;p&gt;Changing titles changes nothing operationally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 2: Treating Platform Teams as Infrastructure Operations
&lt;/h2&gt;

&lt;p&gt;Platform teams should optimise developer experience.&lt;/p&gt;

&lt;p&gt;Not merely manage Kubernetes clusters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 3: Ignoring Cognitive Load
&lt;/h2&gt;

&lt;p&gt;More ownership is not always better.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 4: Measuring Utilisation Instead of Flow
&lt;/h2&gt;

&lt;p&gt;Highly utilised teams often create slower organisations overall.&lt;/p&gt;

&lt;p&gt;Flow efficiency matters more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended Organisational Architecture
&lt;/h2&gt;

&lt;p&gt;Healthy modern engineering organisations increasingly resemble&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stream-Aligned Teams
        ↓
Platform-as-a-Service
        ↓
Enabling Teams
        ↓
Specialist Subsystem Teams
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This structure scales operationally far better than traditional siloed models.&lt;/p&gt;

&lt;p&gt;Team Topologies matters because software delivery problems are rarely just technical.&lt;/p&gt;

&lt;p&gt;They are organisational.&lt;/p&gt;

&lt;p&gt;The framework gives engineering leaders a practical vocabulary for understanding why certain DevOps transformations stall despite heavy investment in tooling and automation.&lt;/p&gt;

&lt;p&gt;The most successful organisations consistently optimise for.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fast flow
Low cognitive load
Clear ownership
Self-service platforms
Minimal dependencies
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And those outcomes emerge not from organisational theory alone, but from deliberate topology design.&lt;/p&gt;

&lt;p&gt;Because ultimately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The architecture of your systems
reflects the architecture of your teams.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Always&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>software</category>
      <category>topologies</category>
    </item>
    <item>
      <title>Secrets Management in Modern DevOps: Vault, IRSA, External Secrets When to Use Each</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Fri, 15 May 2026 06:11:00 +0000</pubDate>
      <link>https://dev.to/varunvarde/secrets-management-in-modern-devops-vault-irsa-external-secrets-when-to-use-each-352m</link>
      <guid>https://dev.to/varunvarde/secrets-management-in-modern-devops-vault-irsa-external-secrets-when-to-use-each-352m</guid>
      <description>&lt;p&gt;Secrets management failures rarely begin with malicious intent.&lt;/p&gt;

&lt;p&gt;They begin with expediency.&lt;/p&gt;

&lt;p&gt;An engineer hardcodes an API key “temporarily.” A .env file gets committed accidentally. A production database password gets shared in Slack during an outage because “we’ll rotate it later.” Eventually those shortcuts accumulate into a sprawling credential catastrophe hidden beneath otherwise competent infrastructure.&lt;/p&gt;

&lt;p&gt;The uncomfortable truth is that poor secrets hygiene exists everywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Startups&lt;/li&gt;
&lt;li&gt;Scaleups&lt;/li&gt;
&lt;li&gt;Enterprises&lt;/li&gt;
&lt;li&gt;Banks&lt;/li&gt;
&lt;li&gt;Government systems&lt;/li&gt;
&lt;li&gt;Fortune 500 infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The issue is rarely ignorance. It is architectural ambiguity.&lt;/p&gt;

&lt;p&gt;Modern DevOps teams now face multiple competing approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud-native identity systems&lt;/li&gt;
&lt;li&gt;Kubernetes secret abstractions&lt;/li&gt;
&lt;li&gt;Vault&lt;/li&gt;
&lt;li&gt;External Secrets Operator&lt;/li&gt;
&lt;li&gt;Sealed Secrets&lt;/li&gt;
&lt;li&gt;Workload identity federation&lt;/li&gt;
&lt;li&gt;Dynamic credentials&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Choosing incorrectly creates operational fragility. Choosing well dramatically improves both security and developer experience.&lt;/p&gt;

&lt;p&gt;This guide explains when to use each model, where each one fails, and how to evolve from common anti-patterns toward a production-grade secrets architecture without detonating existing workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Secrets Management Anti-Patterns (and Their Blast Radius)
&lt;/h2&gt;

&lt;p&gt;Before discussing solutions, understand the failure modes.&lt;/p&gt;

&lt;p&gt;Because nearly every modern secrets architecture exists to solve one of these disasters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-Pattern 1: Hardcoded Secrets in Source Code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-prod-293847239847&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not merely bad practice.&lt;/p&gt;

&lt;p&gt;It is operationally radioactive.&lt;/p&gt;

&lt;p&gt;Once committed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git history preserves it&lt;/li&gt;
&lt;li&gt;Forks replicate it&lt;/li&gt;
&lt;li&gt;CI logs may expose it&lt;/li&gt;
&lt;li&gt;Developers clone it locally&lt;/li&gt;
&lt;li&gt;Backups persist it indefinitely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even if deleted later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-Pattern 2: Shared Credentials&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prod-admin / password123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Used by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers&lt;/li&gt;
&lt;li&gt;CI systems&lt;/li&gt;
&lt;li&gt;Automation tools&lt;/li&gt;
&lt;li&gt;Contractors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No attribution
No least privilege
No revocation granularity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Shared credentials eliminate accountability entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-Pattern 3: Long-Lived Cloud Access Keys&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stored inside:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jenkins&lt;/li&gt;
&lt;li&gt;GitHub Actions&lt;/li&gt;
&lt;li&gt;Kubernetes Secrets&lt;/li&gt;
&lt;li&gt;Terraform variables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Static credentials eventually leak.&lt;/p&gt;

&lt;p&gt;The question is timing, not probability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-Pattern 4: Kubernetes Secrets Misunderstood as Encryption&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Base64 encoding is not encryption.&lt;/p&gt;

&lt;p&gt;This surprises people alarmingly often.&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"cGFzc3dvcmQ="&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outputs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubernetes Secrets require additional controls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Encryption at rest&lt;/li&gt;
&lt;li&gt;RBAC&lt;/li&gt;
&lt;li&gt;Admission policies&lt;/li&gt;
&lt;li&gt;Audit logging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Otherwise they become plaintext credential storage with better branding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understanding the Modern Secrets Management Stack&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Modern secrets management generally falls into four categories&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F637vw56z5kk9qrywpb9t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F637vw56z5kk9qrywpb9t.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each solves different problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IRSA / Workload Identity: Cloud-Native Secretless Authentication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the most important architectural shift in modern cloud security&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stop distributing credentials.
Start distributing identity.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of giving workloads access keys&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pod → authenticated identity → temporary credentials
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No static secrets required.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS IRSA (IAM Roles for Service Accounts)
&lt;/h2&gt;

&lt;p&gt;Pods authenticate using Kubernetes service accounts mapped to IAM roles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terraform IRSA Role&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"payment_service"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"payment-service-role"&lt;/span&gt;

  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;

    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;

      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Federated&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_openid_connect_provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;

      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nx"&gt;aws_eks_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;identity&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;oidc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;issuer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;"https://"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;""&lt;/span&gt;
          &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:sub"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
          &lt;span class="s2"&gt;"system:serviceaccount:payments:payment-service"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Kubernetes Service Account&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;

&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-service&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;

  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::ACCOUNT:role/payment-service-role&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pods automatically receive temporary credentials.&lt;/p&gt;

&lt;p&gt;No secrets required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why IRSA Is Excellent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No static AWS keys&lt;/li&gt;
&lt;li&gt;Automatic credential rotation&lt;/li&gt;
&lt;li&gt;IAM-native permissions&lt;/li&gt;
&lt;li&gt;Short-lived credentials&lt;/li&gt;
&lt;li&gt;Excellent auditability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This should be the default model for AWS-native workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  GCP Workload Identity Equivalent
&lt;/h2&gt;

&lt;p&gt;GCP uses&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kubernetes Service Account
↔
Google Service Account
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Equivalent concept. Different implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Azure Workload Identity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Azure now supports federated workload identity similarly.&lt;/p&gt;

&lt;p&gt;The industry is converging on identity federation rather than credential distribution.&lt;/p&gt;

&lt;p&gt;This is good.&lt;/p&gt;

&lt;p&gt;When IRSA / Workload Identity Is NOT Enough&lt;/p&gt;

&lt;p&gt;Cloud-native identity works beautifully for cloud APIs.&lt;/p&gt;

&lt;p&gt;It becomes weaker when dealing with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Databases&lt;/li&gt;
&lt;li&gt;Third-party APIs&lt;/li&gt;
&lt;li&gt;Cross-cloud systems&lt;/li&gt;
&lt;li&gt;Legacy applications&lt;/li&gt;
&lt;li&gt;Dynamic credential issuance&lt;/li&gt;
&lt;li&gt;Multi-cluster secret orchestration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where Vault becomes valuable.&lt;/p&gt;

&lt;h2&gt;
  
  
  HashiCorp Vault: When You Need More Than Cloud-Native
&lt;/h2&gt;

&lt;p&gt;Vault solves problems identity federation alone cannot.&lt;/p&gt;

&lt;p&gt;Especially dynamic secrets.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Vault Capability
&lt;/h2&gt;

&lt;p&gt;Vault does not merely store secrets.&lt;/p&gt;

&lt;p&gt;It generates them dynamically.&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application requests PostgreSQL credentials
↓
Vault creates short-lived DB user
↓
Credentials expire automatically
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Massive security improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vault Kubernetes Authentication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault auth &lt;span class="nb"&gt;enable &lt;/span&gt;kubernetes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Vault Role Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault write auth/kubernetes/role/payment-api &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;bound_service_account_names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;payment-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;bound_service_account_namespaces&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;payments &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;policies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;payment-read &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pods authenticate automatically via Kubernetes identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic Database Credentials&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault &lt;span class="nb"&gt;read &lt;/span&gt;database/creds/payment-role
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v-token-abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"generated-secret"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"lease_duration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Credentials expire automatically after one hour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When Vault Is the Right Choice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use Vault when you need&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Vault&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dynamic secrets&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-cloud support&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-grained audit logs&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PKI management&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database credential rotation&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secret leasing&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Vault Tradeoffs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Vault is operationally heavier.&lt;/p&gt;

&lt;p&gt;You now manage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HA clustering&lt;/li&gt;
&lt;li&gt;Storage backend&lt;/li&gt;
&lt;li&gt;Unseal process&lt;/li&gt;
&lt;li&gt;Disaster recovery&lt;/li&gt;
&lt;li&gt;Performance replication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vault is powerful because it solves hard problems.&lt;/p&gt;

&lt;p&gt;Hard problems come with operational complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  External Secrets Operator: The Kubernetes-Native Abstraction Layer
&lt;/h2&gt;

&lt;p&gt;External Secrets Operator (ESO) is one of the cleanest Kubernetes-native abstractions available today.&lt;/p&gt;

&lt;p&gt;Instead of storing secrets directly in Kubernetes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kubernetes Secret
← synced from →
Vault / AWS Secrets Manager / GCP Secret Manager
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Installing ESO&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;external-secrets external-secrets/external-secrets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  AWS Secrets Manager Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ExternalSecret&lt;/span&gt;

&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-api-secret&lt;/span&gt;

&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;refreshInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;

  &lt;span class="na"&gt;secretStoreRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-secret-store&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretStore&lt;/span&gt;

  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-api-secret&lt;/span&gt;

  &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-key&lt;/span&gt;
    &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod/payment-api&lt;/span&gt;
      &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api_key&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why ESO Is Excellent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes-native&lt;/li&gt;
&lt;li&gt;GitOps-friendly&lt;/li&gt;
&lt;li&gt;Central secret backend&lt;/li&gt;
&lt;li&gt;Automatic refresh&lt;/li&gt;
&lt;li&gt;Cleaner operational model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ESO is often the best abstraction for Kubernetes workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ESO Is NOT Enough
&lt;/h2&gt;

&lt;p&gt;ESO synchronises secrets.&lt;/p&gt;

&lt;p&gt;It does not generate dynamic credentials.&lt;/p&gt;

&lt;p&gt;If you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dynamic DB users&lt;/li&gt;
&lt;li&gt;Certificate issuance&lt;/li&gt;
&lt;li&gt;Secret leasing&lt;/li&gt;
&lt;li&gt;PKI workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You still need Vault or equivalent systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sealed Secrets: Simple Offline Encryption for GitOps
&lt;/h2&gt;

&lt;p&gt;Sealed Secrets solve a specific problem elegantly&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;How do you store encrypted secrets safely in Git?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sealed Secret Workflow&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Developer creates&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create secret generic app-secret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Encrypts&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubeseal &lt;span class="nt"&gt;--format&lt;/span&gt; yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bitnami.com/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SealedSecret&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only the cluster controller can decrypt it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Teams Love Sealed Secrets
&lt;/h2&gt;

&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple&lt;/li&gt;
&lt;li&gt;GitOps-compatible&lt;/li&gt;
&lt;li&gt;Easy onboarding&lt;/li&gt;
&lt;li&gt;No external dependency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where Sealed Secrets Fall Short&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Static secrets only&lt;/li&gt;
&lt;li&gt;No automatic rotation&lt;/li&gt;
&lt;li&gt;Kubernetes-scoped&lt;/li&gt;
&lt;li&gt;No dynamic credential issuance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Excellent for smaller GitOps environments.&lt;/p&gt;

&lt;p&gt;Less ideal for enterprise-scale secret orchestration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secrets Rotation: The Missing Piece Most Implementations Skip
&lt;/h2&gt;

&lt;p&gt;This is the most neglected part of secrets management.&lt;/p&gt;

&lt;p&gt;Teams store secrets securely but never rotate them.&lt;/p&gt;

&lt;p&gt;Which defeats half the purpose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rotation Targets&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rotate regularly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Secret Type&lt;/th&gt;
&lt;th&gt;Rotation Frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API keys&lt;/td&gt;
&lt;td&gt;30–90 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DB credentials&lt;/td&gt;
&lt;td&gt;Dynamic preferred&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TLS certificates&lt;/td&gt;
&lt;td&gt;30–90 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI tokens&lt;/td&gt;
&lt;td&gt;30 days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Vault Dynamic Rotation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Best model&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Generate → use → expire automatically
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No manual rotation required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Secrets Manager Rotation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example Lambda rotation&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;RotationRules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;AutomaticallyAfterDays&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Common Rotation Failure Mode&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Applications caching credentials indefinitely.&lt;/p&gt;

&lt;p&gt;Result&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Secret rotated
↓
Application breaks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Applications must reload credentials gracefully.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audit Logging: Knowing Who Accessed What and When
&lt;/h2&gt;

&lt;p&gt;Secrets access without auditing is operational blindness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vault Audit Logging&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Enable&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault audit &lt;span class="nb"&gt;enable &lt;/span&gt;file &lt;span class="nv"&gt;file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/log/vault_audit.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every secret request becomes traceable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS CloudTrail&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;IRSA requests appear in CloudTrail automatically.&lt;/p&gt;

&lt;p&gt;This is one reason identity federation is so operationally attractive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical Audit Questions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You should always answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who accessed this secret?&lt;/li&gt;
&lt;li&gt;When?&lt;/li&gt;
&lt;li&gt;From which workload?&lt;/li&gt;
&lt;li&gt;Was it expected?&lt;/li&gt;
&lt;li&gt;Was it anomalous?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without auditability, incident response becomes guesswork.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Migration Playbook: Moving from Hard-Coded to Vault in 4 Weeks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most organisations cannot migrate instantly.&lt;/p&gt;

&lt;p&gt;They need staged evolution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1: Discovery&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Identify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;.env files&lt;/li&gt;
&lt;li&gt;Hardcoded credentials&lt;/li&gt;
&lt;li&gt;CI secrets&lt;/li&gt;
&lt;li&gt;Kubernetes Secrets&lt;/li&gt;
&lt;li&gt;Shared accounts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 2: Centralisation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Move secrets into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vault&lt;/li&gt;
&lt;li&gt;AWS Secrets Manager&lt;/li&gt;
&lt;li&gt;GCP Secret Manager&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without changing applications yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 3: Kubernetes Integration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Deploy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ESO&lt;/li&gt;
&lt;li&gt;Vault Agent Injector&lt;/li&gt;
&lt;li&gt;IRSA&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Start consuming secrets dynamically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 4: Rotation and Cleanup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rotate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old credentials&lt;/li&gt;
&lt;li&gt;Shared passwords&lt;/li&gt;
&lt;li&gt;Long-lived tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then delete legacy storage completely.&lt;/p&gt;

&lt;p&gt;Not “later.”&lt;/p&gt;

&lt;p&gt;Immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Cloud Secrets: Managing Credentials Across AWS, Azure, and GCP
&lt;/h2&gt;

&lt;p&gt;Multi-cloud secrets management becomes operationally difficult quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Recommended Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS-only&lt;/td&gt;
&lt;td&gt;IRSA + Secrets Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCP-only&lt;/td&gt;
&lt;td&gt;Workload Identity + Secret Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure-only&lt;/td&gt;
&lt;td&gt;Managed Identity + Key Vault&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-cloud&lt;/td&gt;
&lt;td&gt;Vault&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Vault becomes particularly valuable when standardising identity across clouds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended Enterprise Architecture&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kubernetes Workload
        ↓
IRSA / Workload Identity
        ↓
Vault / Cloud Secret Manager
        ↓
External Secrets Operator
        ↓
Application Runtime
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Layered abstractions create operational flexibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Secrets Management Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Treating Kubernetes Secrets as Secure by Default&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They are not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Never Rotating Credentials&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Static secrets become permanent liabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Using Shared Accounts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Breaks attribution entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Giving Vault Excessive Permissions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Vault should broker secrets.&lt;/p&gt;

&lt;p&gt;Not become root over everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Ignoring Audit Logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Visibility matters as much as encryption.&lt;/p&gt;

&lt;p&gt;Modern secrets management is no longer about hiding passwords.&lt;/p&gt;

&lt;p&gt;It is about distributing trust safely.&lt;/p&gt;

&lt;p&gt;The strongest DevOps environments increasingly follow several principles:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Identity over credentials
Temporary over permanent
Dynamic over static
Automated over manual
Auditable over opaque
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;IRSA and workload identity eliminate entire classes of cloud credential risk.&lt;/p&gt;

&lt;p&gt;Vault enables dynamic, short-lived infrastructure authentication.&lt;/p&gt;

&lt;p&gt;External Secrets Operator creates elegant Kubernetes-native integration.&lt;/p&gt;

&lt;p&gt;Sealed Secrets simplify GitOps encryption.&lt;/p&gt;

&lt;p&gt;Each tool has a legitimate role.&lt;/p&gt;

&lt;p&gt;The mistake is not choosing the wrong product.&lt;/p&gt;

&lt;p&gt;The mistake is assuming one tool solves every secrets problem equally well.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>programming</category>
      <category>irsa</category>
      <category>linux</category>
    </item>
    <item>
      <title>DevSecOps Pipeline in a Day: Automated Security from Commit to Deploy</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Tue, 12 May 2026 05:23:00 +0000</pubDate>
      <link>https://dev.to/varunvarde/devsecops-pipeline-in-a-day-automated-security-from-commit-to-deploy-1c6h</link>
      <guid>https://dev.to/varunvarde/devsecops-pipeline-in-a-day-automated-security-from-commit-to-deploy-1c6h</guid>
      <description>&lt;p&gt;Security that happens after deployment is already too late.&lt;/p&gt;

&lt;p&gt;By the time a quarterly penetration test discovers hardcoded secrets, vulnerable containers, or publicly exposed infrastructure, the vulnerable code has usually been in production for months. Sometimes years. The remediation backlog grows. Developers lose context. Security becomes bureaucratic archaeology rather than operational engineering.&lt;/p&gt;

&lt;p&gt;DevSecOps changes the timing.&lt;/p&gt;

&lt;p&gt;Instead of treating security as a gate at the end of delivery, it embeds security checks throughout the software lifecycle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Commit → Build → Test → Scan → Deploy → Monitor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every stage becomes an opportunity to reduce risk automatically.&lt;/p&gt;

&lt;p&gt;This tutorial builds a complete open-source DevSecOps pipeline in a single day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Secret detection before commits&lt;/li&gt;
&lt;li&gt;SAST on every pull request&lt;/li&gt;
&lt;li&gt;Dependency vulnerability scanning&lt;/li&gt;
&lt;li&gt;Container image scanning&lt;/li&gt;
&lt;li&gt;Terraform and Kubernetes IaC scanning&lt;/li&gt;
&lt;li&gt;DAST against staging environments&lt;/li&gt;
&lt;li&gt;Centralised vulnerability reporting&lt;/li&gt;
&lt;li&gt;Security SLA policies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No enterprise security platform required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DevSecOps Security Layer Model Where Each Check Lives
&lt;/h2&gt;

&lt;p&gt;Security works best when distributed.&lt;/p&gt;

&lt;p&gt;Not centralised.&lt;/p&gt;

&lt;p&gt;Each security control belongs at the earliest operational layer where it can execute effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Six-Layer Model
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1 → Developer workstation
Layer 2 → Pull request pipeline
Layer 3 → Dependency validation
Layer 4 → Container security
Layer 5 → Infrastructure-as-Code validation
Layer 6 → Runtime application testing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every layer catches different failure modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Layering Matters
&lt;/h2&gt;

&lt;p&gt;No single scanner catches everything.&lt;/p&gt;

&lt;p&gt;Example&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzlv5etk2eal8d4z6p8z6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzlv5etk2eal8d4z6p8z6.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Security becomes resilient through redundancy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Pre-Commit Hooks — detect-secrets and git-secrets Setup
&lt;/h2&gt;

&lt;p&gt;The cheapest vulnerability to fix is the one that never enters Git history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installing Pre-Commit Framework&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pre-commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;detect-secrets Configuration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;repos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/Yelp/detect-secrets&lt;/span&gt;
  &lt;span class="na"&gt;rev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1.4.0&lt;/span&gt;
  &lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;detect-secrets&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install hooks&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pre-commit &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;git-secrets for AWS Credentials&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git secrets &lt;span class="nt"&gt;--install&lt;/span&gt;
git secrets &lt;span class="nt"&gt;--register-aws&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example Detection&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS_SECRET_ACCESS_KEY detected
Commit rejected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents catastrophic credential leakage before CI even starts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Pre-Commit Security Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Secrets committed once often persist forever in Git history.&lt;/p&gt;

&lt;p&gt;Even after deletion.&lt;/p&gt;

&lt;p&gt;Prevention beats remediation.&lt;/p&gt;

&lt;p&gt;Always.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: SAST in CI — Semgrep for Application Code
&lt;/h2&gt;

&lt;p&gt;Static Application Security Testing identifies insecure coding patterns before deployment.&lt;/p&gt;

&lt;p&gt;Semgrep is exceptionally effective because it balances signal quality with developer usability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Actions SAST Workflow&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;sast&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Semgrep SAST&lt;/span&gt;
    &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;returntocorp/semgrep-action@v1&lt;/span&gt;
    &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p/owasp-top-ten&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;p/python&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;p/javascript"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example Vulnerability Detection&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM users WHERE id = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Semgrep flags&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Possible SQL injection vulnerability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Custom Security Rules&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Production environments eventually require organisation-specific rules.&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;no-public-s3&lt;/span&gt;
  &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;"public-read"'&lt;/span&gt;
  &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Public S3 ACL forbidden&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ERROR&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why SAST Must Run on Every PR&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Security reviews delayed until release branches create vulnerability bottlenecks.&lt;/p&gt;

&lt;p&gt;Fast feedback changes behaviour.&lt;/p&gt;

&lt;p&gt;Delayed feedback creates resentment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: SCA — OWASP Dependency-Check and Trivy for Dependencies
&lt;/h2&gt;

&lt;p&gt;Modern applications inherit more code than they write.&lt;/p&gt;

&lt;p&gt;Dependency vulnerabilities therefore matter enormously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OWASP Dependency-Check&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dependency-check.sh &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--project&lt;/span&gt; app &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--scan&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trivy Dependency Scan&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;trivy fs &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Critical vulnerability:
log4j-core 2.14.1
CVE-2021-44228
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5a39xvsx2zaid2flmaz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5a39xvsx2zaid2flmaz.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency Update Automation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use Renovate or Dependabot&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="na"&gt;updates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;package-ecosystem&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;daily&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Automation reduces vulnerability half-life dramatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 4: Container Image Scanning — Trivy in Your Docker Build Pipeline
&lt;/h2&gt;

&lt;p&gt;Containers frequently contain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vulnerable OS packages&lt;/li&gt;
&lt;li&gt;Unpatched libraries&lt;/li&gt;
&lt;li&gt;Misconfigurations&lt;/li&gt;
&lt;li&gt;Embedded secrets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scanning them is mandatory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build and Scan Workflow&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;container-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build image&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t app:${{ github.sha }} .&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trivy vulnerability scan&lt;/span&gt;
    &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
    &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app:${{ github.sha }}&lt;/span&gt;
      &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRITICAL,HIGH'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example Container Findings&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openssl package vulnerable
Severity: HIGH
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Distroless Images Reduce Attack Surface&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; ubuntu:22.04&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; gcr.io/distroless/static&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Smaller images. Fewer packages. Fewer CVEs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 5: IaC Security Scanning — Checkov on Every Terraform Plan
&lt;/h2&gt;

&lt;p&gt;Infrastructure misconfigurations cause some of the most damaging cloud breaches.&lt;/p&gt;

&lt;p&gt;IaC scanning catches them before deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checkov GitHub Action&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;iac-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkov IaC scan&lt;/span&gt;
    &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridgecrewio/checkov-action@master&lt;/span&gt;
    &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;terraform/&lt;/span&gt;
      &lt;span class="na"&gt;framework&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;terraform&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example Terraform Misconfiguration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_security_group"&lt;/span&gt; &lt;span class="s2"&gt;"bad"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Checkov flags&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Security group allows unrestricted ingress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Recommended IaC Policies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Block:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public S3 buckets&lt;/li&gt;
&lt;li&gt;Open security groups&lt;/li&gt;
&lt;li&gt;Unencrypted databases&lt;/li&gt;
&lt;li&gt;Unencrypted EBS volumes&lt;/li&gt;
&lt;li&gt;Wildcard IAM policies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Layer 6: DAST — OWASP ZAP Against Your Staging Environment
&lt;/h2&gt;

&lt;p&gt;DAST validates runtime behaviour.&lt;/p&gt;

&lt;p&gt;Unlike SAST, it tests deployed applications directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OWASP ZAP Docker Scan&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-t&lt;/span&gt; owasp/zap2docker-stable &lt;span class="se"&gt;\&lt;/span&gt;
  zap-baseline.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; https://staging.example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CI Integration Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ZAP Scan&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;docker run -t owasp/zap2docker-stable \&lt;/span&gt;
      &lt;span class="s"&gt;zap-baseline.py \&lt;/span&gt;
      &lt;span class="s"&gt;-t https://staging.example.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Vulnerabilities DAST Finds Well&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XSS&lt;/li&gt;
&lt;li&gt;Missing headers&lt;/li&gt;
&lt;li&gt;Insecure cookies&lt;/li&gt;
&lt;li&gt;Open redirects&lt;/li&gt;
&lt;li&gt;Authentication weaknesses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why DAST Complements SAST&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SAST sees code.&lt;/p&gt;

&lt;p&gt;DAST sees behaviour.&lt;/p&gt;

&lt;p&gt;You need both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralising Findings in Defect Dojo&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without centralisation, findings scatter across tools and become operational noise.&lt;/p&gt;

&lt;p&gt;Defect Dojo consolidates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SAST results&lt;/li&gt;
&lt;li&gt;Dependency scans&lt;/li&gt;
&lt;li&gt;Container findings&lt;/li&gt;
&lt;li&gt;DAST reports
&lt;strong&gt;Defect Dojo Deployment&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;defectdojo defectdojo/defectdojo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Importing Scan Results&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://dojo/api/v2/import-scan/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why Centralisation Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Security programmes fail when visibility fragments.&lt;/p&gt;

&lt;p&gt;One dashboard changes operational behaviour.&lt;/p&gt;

&lt;h2&gt;
  
  
  SLA Policies: How to Treat CRITICAL vs HIGH vs MEDIUM Findings
&lt;/h2&gt;

&lt;p&gt;Not all vulnerabilities deserve identical urgency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended SLA Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwzg6g6wvrj87bu6dkd6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwzg6g6wvrj87bu6dkd6.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CI Enforcement Strategy&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CRITICAL → Block merge
HIGH → Fail release
MEDIUM → Warn only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Security governance must remain operationally realistic.&lt;/p&gt;

&lt;p&gt;Overly aggressive policies create bypass behaviour.&lt;/p&gt;

&lt;p&gt;Measuring DevSecOps Effectiveness Mean Time to Remediation&lt;/p&gt;

&lt;p&gt;Security programmes require measurable outcomes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core Metrics&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Mean Time to Remediation (MTTR)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Discovery → Remediation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Shorter is better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vulnerability Escape Rate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;How many vulnerabilities reach production?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;False Positive Rate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If scanners create excessive noise&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developers stop trusting alerts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Signal quality matters enormously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coverage Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repositories scanned&lt;/li&gt;
&lt;li&gt;Terraform coverage&lt;/li&gt;
&lt;li&gt;Container coverage&lt;/li&gt;
&lt;li&gt;Dependency scan adoption&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Full GitHub Actions DevSecOps Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DevSecOps Pipeline&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

  &lt;span class="na"&gt;secrets-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TruffleHog&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trufflesecurity/trufflehog@main&lt;/span&gt;

  &lt;span class="na"&gt;sast&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Semgrep&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;returntocorp/semgrep-action@v1&lt;/span&gt;

  &lt;span class="na"&gt;dependency-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trivy FS Scan&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trivy fs .&lt;/span&gt;

  &lt;span class="na"&gt;container-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t app:${{ github.sha }} .&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trivy Image Scan&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trivy image app:${{ github.sha }}&lt;/span&gt;

  &lt;span class="na"&gt;iac-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkov&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridgecrewio/checkov-action@master&lt;/span&gt;

  &lt;span class="na"&gt;dast&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OWASP ZAP&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;docker run -t owasp/zap2docker-stable \&lt;/span&gt;
        &lt;span class="s"&gt;zap-baseline.py \&lt;/span&gt;
        &lt;span class="s"&gt;-t https://staging.example.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common DevSecOps Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Blocking Everything Immediately&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Teams bypass pipelines if friction becomes unbearable.&lt;/p&gt;

&lt;p&gt;Adopt incrementally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Ignoring False Positives&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Poor signal quality destroys developer trust.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Treating Security as Separate from Engineering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Security tooling must integrate into existing workflows.&lt;/p&gt;

&lt;p&gt;Not create parallel ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. No Ownership Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Findings without owners become backlog sediment.&lt;/p&gt;

&lt;p&gt;DevSecOps is not about inserting security gates into delivery pipelines.&lt;/p&gt;

&lt;p&gt;It is about making security part of normal engineering behaviour.&lt;/p&gt;

&lt;p&gt;The most successful DevSecOps environments share several characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast feedback&lt;/li&gt;
&lt;li&gt;Automated enforcement&lt;/li&gt;
&lt;li&gt;Low-friction tooling&lt;/li&gt;
&lt;li&gt;Developer-visible results&lt;/li&gt;
&lt;li&gt;Incremental adoption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Security stops being ceremonial compliance theatre and becomes operational engineering.&lt;/p&gt;

&lt;p&gt;And that is the critical shift.&lt;/p&gt;

&lt;p&gt;Because modern software delivery moves too quickly for security reviews performed weeks after deployment.&lt;/p&gt;

&lt;p&gt;The only scalable model is continuous security at continuous delivery speed.&lt;/p&gt;

</description>
      <category>devsecops</category>
      <category>devops</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>FinOps for DevOps Engineers: The Complete Cloud Cost Optimisation Playbook</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Fri, 08 May 2026 12:03:16 +0000</pubDate>
      <link>https://dev.to/varunvarde/finops-for-devops-engineers-the-complete-cloud-cost-optimisation-playbook-2ep9</link>
      <guid>https://dev.to/varunvarde/finops-for-devops-engineers-the-complete-cloud-cost-optimisation-playbook-2ep9</guid>
      <description>&lt;p&gt;Cloud bills rarely explode because of one catastrophic decision. They grow incrementally. Quietly. A forgotten load balancer here. Overprovisioned Kubernetes nodes there. NAT Gateway traffic multiplying invisibly in the background like fiscal mold behind drywall.&lt;/p&gt;

&lt;p&gt;Most organisations approach FinOps as a finance exercise. That is a strategic mistake.&lt;/p&gt;

&lt;p&gt;The engineers provisioning infrastructure are the same engineers best positioned to optimise it. DevOps teams control autoscaling, storage policies, networking topology, observability retention, and workload scheduling. They are not adjacent to cloud cost optimisation. They are the operational epicentre of it.&lt;/p&gt;

&lt;p&gt;This playbook focuses on practical FinOps implementation for DevOps and platform engineers. Not abstract governance theory. Actual engineering patterns that reduce spend without degrading reliability.&lt;/p&gt;

&lt;p&gt;The optimisation path is organised by return on investment. Start with visibility. Then tackle compute, storage, networking, Kubernetes, and finally governance automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 1: Visibility First — Tagging Standards and Cost Attribution
&lt;/h2&gt;

&lt;p&gt;You cannot optimise what you cannot attribute.&lt;/p&gt;

&lt;p&gt;Most cloud environments fail at cost management because nobody knows which team owns what.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Minimum Viable Tagging Standard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every resource should contain&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;team&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"platform-engineering"&lt;/span&gt;
  &lt;span class="nx"&gt;environment&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
  &lt;span class="nx"&gt;application&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"checkout-api"&lt;/span&gt;
  &lt;span class="nx"&gt;cost-centre&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ENG-042"&lt;/span&gt;
  &lt;span class="nx"&gt;owner&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"payments-team"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why Tags Matter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without tags&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cloud bill = giant undifferentiated blob
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With tags&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cloud bill = attributable operational data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This changes engineering behaviour immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Cost Allocation Tags
&lt;/h2&gt;

&lt;p&gt;Enable them explicitly&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ce list-cost-allocation-tags
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then activate&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Billing Console → Cost Allocation Tags → Activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cost Dashboard Strategy
&lt;/h2&gt;

&lt;p&gt;Build dashboards around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Cost by team&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cost by environment&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cost by service&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Week-over-week growth&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Top anomalous resources&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3s2qml6nzonzlvrmizg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi3s2qml6nzonzlvrmizg.png" alt=" " width="607" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 2: Compute Optimisation — Rightsizing, Spot, Graviton
&lt;/h2&gt;

&lt;p&gt;Compute is usually the largest controllable expense category.&lt;/p&gt;

&lt;p&gt;And most environments are dramatically oversized.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rightsizing EC2 Instances
&lt;/h2&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;m5.4xlarge
Average CPU: 9%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not infrastructure. It is financial leakage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identify Idle Instances
&lt;/h2&gt;

&lt;p&gt;Using CloudWatch&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fphhuu32fnijfwkffgkc1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fphhuu32fnijfwkffgkc1.png" alt=" " width="642" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Spot Instances
&lt;/h2&gt;

&lt;p&gt;Spot pricing can reduce costs by 70–90%.&lt;/p&gt;

&lt;p&gt;Perfect for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;CI runners&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Batch jobs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Non-critical workloads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kubernetes worker nodes&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Terraform Spot Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_instance"&lt;/span&gt; &lt;span class="s2"&gt;"spot_worker"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;instance_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"m7g.large"&lt;/span&gt;

  &lt;span class="nx"&gt;instance_market_options&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;market_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spot"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  AWS Graviton Migration
&lt;/h2&gt;

&lt;p&gt;Graviton instances routinely reduce compute costs by 20–40%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Migration Candidate Checklist&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Best workloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stateless APIs&lt;/li&gt;
&lt;li&gt;Containers&lt;/li&gt;
&lt;li&gt;Node.js&lt;/li&gt;
&lt;li&gt;Go&lt;/li&gt;
&lt;li&gt;Java 17+&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes Node Group Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eksctl.io/v1alpha5&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterConfig&lt;/span&gt;

&lt;span class="na"&gt;managedNodeGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;graviton-workers&lt;/span&gt;
    &lt;span class="na"&gt;instanceType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;m7g.large&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Part 3: Storage Optimisation — S3 Tiers, EBS, Lifecycle Policies
&lt;/h2&gt;

&lt;p&gt;Storage inefficiency compounds silently over years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;S3 Lifecycle Policies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The fastest storage win in AWS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terraform Lifecycle Policy&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3_bucket_lifecycle_configuration"&lt;/span&gt; &lt;span class="s2"&gt;"cost_optimised"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_s3_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"archive-old-data"&lt;/span&gt;
    &lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Enabled"&lt;/span&gt;

    &lt;span class="nx"&gt;transition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;days&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
      &lt;span class="nx"&gt;storage_class&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"INTELLIGENT_TIERING"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;transition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;days&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;
      &lt;span class="nx"&gt;storage_class&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"GLACIER_IR"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;transition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;days&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;365&lt;/span&gt;
      &lt;span class="nx"&gt;storage_class&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DEEP_ARCHIVE"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;EBS Optimisation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Common waste patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detached volumes&lt;/li&gt;
&lt;li&gt;Oversized gp3 disks&lt;/li&gt;
&lt;li&gt;Unused snapshots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Find Unattached Volumes&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 describe-volumes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;status,Values&lt;span class="o"&gt;=&lt;/span&gt;available
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unless compliance requires otherwise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 4: Networking Cost Reduction — NAT Gateway, VPC Endpoints, Data Transfer
&lt;/h2&gt;

&lt;p&gt;Networking costs surprise almost everyone.&lt;/p&gt;

&lt;p&gt;Especially NAT Gateways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NAT Gateway Optimisation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;NAT Gateway charges include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hourly fee&lt;/li&gt;
&lt;li&gt;Per-GB transfer fee&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Large clusters can spend thousands monthly on NAT traffic alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Replace NAT Traffic with VPC Endpoints&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;service_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"com.amazonaws.us-east-1.s3"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This eliminates NAT transfer charges for S3 traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reduce Cross-AZ Traffic&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hidden cost source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Service A → AZ-1
Service B → AZ-2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every request incurs transfer cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes Affinity Rules&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;topologySpreadConstraints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;topologyKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;topology.kubernetes.io/zone&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep chatty services co-located.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 5: Database Cost Optimisation — RDS Rightsizing, Aurora Serverless, Read Replica Pruning
&lt;/h2&gt;

&lt;p&gt;Databases are expensive because teams fear touching them.&lt;/p&gt;

&lt;p&gt;Reasonably so.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Rightsizing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU&lt;/li&gt;
&lt;li&gt;Connections&lt;/li&gt;
&lt;li&gt;IOPS&lt;/li&gt;
&lt;li&gt;Memory pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example Downsize&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db.r6g.4xlarge → db.r6g.xlarge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Often invisible to applications.&lt;/p&gt;

&lt;p&gt;Massively visible to finance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aurora Serverless v2&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ideal for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Variable workloads&lt;/li&gt;
&lt;li&gt;Internal APIs&lt;/li&gt;
&lt;li&gt;Intermittent services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Terraform Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;serverlessv2_scaling_configuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;min_capacity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;
  &lt;span class="nx"&gt;max_capacity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Read Replica Cleanup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Common anti-pattern&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Temporary read replica
→ never removed
→ costs persist forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Audit quarterly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 6: Reserved Instances &amp;amp; Savings Plans — When to Buy and How Much
&lt;/h2&gt;

&lt;p&gt;Savings Plans are powerful when used correctly.&lt;/p&gt;

&lt;p&gt;Dangerous when guessed incorrectly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start conservative.&lt;/p&gt;

&lt;p&gt;Target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;60–70% baseline utilisation coverage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Never 100%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute Savings Plans&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Best default option.&lt;/p&gt;

&lt;p&gt;Flexible across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instance families&lt;/li&gt;
&lt;li&gt;Regions&lt;/li&gt;
&lt;li&gt;Compute types&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;AWS Recommendation API&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ce get-savings-plans-purchase-recommendation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use actual usage history.&lt;/p&gt;

&lt;p&gt;Not optimism.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 7: Kubernetes Cost Optimisation — Bin Packing, Cluster Autoscaler, Spot Node Groups
&lt;/h2&gt;

&lt;p&gt;Kubernetes amplifies both efficiency and waste.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bin Packing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Underutilised nodes are financial dead weight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resource Requests Matter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Actual usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;300m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Goldilocks Recommendation Tool&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nb"&gt;install &lt;/span&gt;goldilocks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Automatically suggests request sizing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster Autoscaler&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--balance-similar-node-groups=true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Removes idle nodes dynamically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spot Node Groups&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;capacityType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SPOT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Excellent for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stateless apps&lt;/li&gt;
&lt;li&gt;Batch workers&lt;/li&gt;
&lt;li&gt;CI runners&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Part 8: Monitoring Cost Creep — Alerting on Unexpected Spend Increases
&lt;/h2&gt;

&lt;p&gt;Cost optimisation without monitoring regresses rapidly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget Alerts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS Example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws budgets create-budget
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Prometheus Cost Alert&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;groups:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;name:&lt;/span&gt; &lt;span class="n"&gt;cloud_cost_alerts&lt;/span&gt;
  &lt;span class="n"&gt;rules:&lt;/span&gt;
  &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alert:&lt;/span&gt; &lt;span class="n"&gt;MonthlySpendSpike&lt;/span&gt;
    &lt;span class="n"&gt;expr:&lt;/span&gt; &lt;span class="nb"&gt;increase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cloud_cost_total&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;24h&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Slack Notification Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;webhook_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cloud spend increased unexpectedly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Immediate visibility changes behaviour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 9: The Monthly Cost Review Checklist&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The best FinOps teams operationalise review cadence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monthly Checklist&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Compute&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Idle instances removed&lt;/li&gt;
&lt;li&gt;Rightsizing opportunities reviewed&lt;/li&gt;
&lt;li&gt;Spot coverage audited&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snapshot retention reviewed&lt;/li&gt;
&lt;li&gt;Glacier transitions verified&lt;/li&gt;
&lt;li&gt;Orphaned volumes deleted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node utilisation checked&lt;/li&gt;
&lt;li&gt;Resource requests audited&lt;/li&gt;
&lt;li&gt;Cluster Autoscaler effectiveness reviewed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Networking&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NAT Gateway spend analysed&lt;/li&gt;
&lt;li&gt;Cross-region traffic reviewed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Databases&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read replicas validated&lt;/li&gt;
&lt;li&gt;Aurora scaling reviewed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Untagged resources identified&lt;/li&gt;
&lt;li&gt;Budget alerts tested&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Appendix: Azure and GCP Equivalents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Compute&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4xuavmky3ylbejdfjpjf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4xuavmky3ylbejdfjpjf.png" alt=" " width="770" height="542"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7z3xwsdggqf6cpzuq1m1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7z3xwsdggqf6cpzuq1m1.png" alt=" " width="733" height="152"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;FinOps is not about making infrastructure cheap.&lt;/p&gt;

&lt;p&gt;It is about making infrastructure intentional.&lt;/p&gt;

&lt;p&gt;The most effective DevOps teams treat cloud cost as an engineering metric alongside latency, reliability, and deployment frequency.&lt;/p&gt;

&lt;p&gt;Because every oversized node, forgotten snapshot, or unnecessary NAT transfer represents engineering inefficiency expressed financially.&lt;/p&gt;

&lt;p&gt;The progression usually looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Visibility → Attribution → Accountability → Optimisation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without visibility, optimisation is guesswork.&lt;/p&gt;

&lt;p&gt;Without attribution, accountability disappears.&lt;/p&gt;

&lt;p&gt;Without accountability, cloud spend becomes entropy.&lt;/p&gt;

&lt;p&gt;But when engineers own both infrastructure reliability and infrastructure economics, something powerful happens:&lt;/p&gt;

&lt;p&gt;Systems become leaner.&lt;br&gt;
Architectures become cleaner.&lt;br&gt;
And cloud bills stop being monthly surprises.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>playbook</category>
      <category>aws</category>
    </item>
    <item>
      <title>The FinOps Starter Kit: Making Cloud Cost Visible in 5 Days</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Fri, 01 May 2026 08:29:13 +0000</pubDate>
      <link>https://dev.to/varunvarde/the-finops-starter-kit-making-cloud-cost-visible-in-5-days-4d0k</link>
      <guid>https://dev.to/varunvarde/the-finops-starter-kit-making-cloud-cost-visible-in-5-days-4d0k</guid>
      <description>&lt;p&gt;Most cloud cost advice starts at the wrong layer. It jumps straight into optimization tactics Reserved Instances, Spot capacity, aggressive rightsizing without first addressing the more fundamental problem: visibility.&lt;/p&gt;

&lt;p&gt;Because without visibility, optimization becomes guesswork. And guesswork is expensive.&lt;/p&gt;

&lt;p&gt;This guide takes a different approach. Five days. No third-party FinOps platforms. Just native tooling, deliberate structure, and a system engineers will actually use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 1: Tagging Strategy The Foundation Everything Else Depends On
&lt;/h2&gt;

&lt;p&gt;Every meaningful cost analysis begins with attribution. Without tags, cost data is a monolith. With tags, it becomes dimensional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core Tagging Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A minimal, effective tagging schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"team"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"platform"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"auth-api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"environment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"owner"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"team-lead"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Enforcing Tags at Resource Creation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 run-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image-id&lt;/span&gt; ami-123456 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tag-specifications&lt;/span&gt; &lt;span class="s1"&gt;'ResourceType=instance,Tags=[{Key=team,Value=platform},{Key=service,Value=auth-api}]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tag Compliance Check&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws resourcegroupstaggingapi get-resources &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tag-filters&lt;/span&gt; &lt;span class="nv"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;team
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why This Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tags are not metadata. They are the index keys for your cost database.&lt;/p&gt;

&lt;p&gt;No tags → no attribution → no accountability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 2: AWS Cost Explorer API — Pulling Cost Data Programmatically
&lt;/h2&gt;

&lt;p&gt;The console is fine for humans. Systems need APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basic Cost Query&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="n"&gt;ce&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ce&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ce&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_cost_and_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;TimePeriod&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;End&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;Granularity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DAILY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UnblendedCost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Group by Service and Team Tag&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ce&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_cost_and_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;TimePeriod&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;End&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;Granularity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DAILY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UnblendedCost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;GroupBy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DIMENSION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERVICE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TAG&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;team&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost data is delayed (~24h), but still actionable.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This becomes your source of truth. Everything else builds on it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 3: Building Per-Service Cost Dashboards in Grafana
&lt;/h2&gt;

&lt;p&gt;Raw data is inert. Visualization activates it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS Cost Explorer → Export Script → JSON/Prometheus → Grafana
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example Export Script&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ResultsByTime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Prometheus Metric Format&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;aws_cost&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"EC2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"platform"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mf"&gt;123.45&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Grafana Panel Query&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by(service) (aws_cost)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dashboard Views&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Cost per service&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cost per team&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Daily trend lines&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Top 10 spenders&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good dashboards don’t overwhelm. They illuminate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 4: Anomaly Detection Alerting When Cost Spikes Unexpectedly
&lt;/h2&gt;

&lt;p&gt;Spikes happen. Some are valid. Others are not.&lt;/p&gt;

&lt;p&gt;Detection must be immediate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Threshold Alert&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_cost_daily &amp;gt; 500
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deviation-Based Alert&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws_cost_daily &amp;gt; avg_over_time(aws_cost_daily[7d]) * 1.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CloudWatch Anomaly Detection&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws cloudwatch put-anomaly-detector &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metric-name&lt;/span&gt; EstimatedCharges &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; AWS/Billing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alert Routing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert → SNS → Slack / Email
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Short spikes matter. Long drifts matter more.&lt;/p&gt;

&lt;p&gt;Both need visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Day 5: The Weekly Cost Digest Automated Slack Report Per Team
&lt;/h2&gt;

&lt;p&gt;Dashboards are passive. Digests are proactive.&lt;/p&gt;

&lt;p&gt;Engineers rarely check dashboards. They read Slack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weekly Cost Digest Script&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# cost_digest.py Weekly per-team cost report to Slack
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;slack_sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WebClient&lt;/span&gt;

&lt;span class="n"&gt;ce&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ce&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;slack&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WebClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_SLACK_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_team_costs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;team_tag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ce&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_cost_and_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;TimePeriod&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;End&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
        &lt;span class="n"&gt;Granularity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DAILY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;team&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;team_tag&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MatchOptions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EQUALS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}},&lt;/span&gt;
        &lt;span class="n"&gt;Metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UnblendedCost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;GroupBy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DIMENSION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERVICE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;totals&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ResultsByTime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Groups&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;svc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UnblendedCost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;totals&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;totals&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;totals&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;post_digest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;team&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;costs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_team_costs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;team&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;costs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*Weekly Cloud Cost Digest — Team: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;team&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total (last 7 days): *$&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;svc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;costs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])[:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; • &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;svc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;slack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat_postMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Run weekly via EventBridge scheduled rule
&lt;/span&gt;&lt;span class="nf"&gt;post_digest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;platform-team&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#platform-costs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Scheduling with EventBridge&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws events put-rule &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schedule-expression&lt;/span&gt; &lt;span class="s2"&gt;"cron(0 9 ? * MON *)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a ritual. A cadence. Cost becomes visible and social.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bonus: Cost-per-Request Metrics Using CloudWatch + Lambda&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Absolute cost is useful. Unit cost is transformative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom Metric Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="n"&gt;cloudwatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cloudwatch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;cloudwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_metric_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AppMetrics&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MetricData&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CostPerRequest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.002&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Formula&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost per request = Total service cost / Total requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now teams optimize efficiency not just spend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Azure and GCP Equivalents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Azure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Cost Management API&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Azure Monitor&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tags via Resource Manager&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GCP&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Billing Export to BigQuery&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Looker Studio dashboards&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Labels for resource tagging&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The principles remain identical. Only the APIs differ.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Tagging Mistakes (and How to Fix Them)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Inconsistent Tag Keys&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;team vs Team vs TEAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fix: Enforce via policy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Missing Tags on Critical Resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fix: Use SCPs or IAM policies&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Deny"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ec2:RunInstances"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Null"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"aws:RequestTag/team"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Over-Tagging&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Too many tags dilute clarity.&lt;/p&gt;

&lt;p&gt;Fix: Keep it minimal. Intentional.&lt;/p&gt;

&lt;p&gt;FinOps does not begin with optimization. It begins with visibility.&lt;/p&gt;

&lt;p&gt;In five days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Costs become attributable&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dashboards become actionable&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alerts become immediate&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Engineers become accountable&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And something subtle happens.&lt;/p&gt;

&lt;p&gt;Cost stops being a finance concern. It becomes an engineering signal.&lt;/p&gt;

&lt;p&gt;That shift quiet, structural, and profound is where real savings begin.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>devops</category>
      <category>finops</category>
      <category>aws</category>
    </item>
    <item>
      <title>Your First LLMOps Pipeline: From Prompt to Production in One Sprint</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Tue, 21 Apr 2026 04:37:00 +0000</pubDate>
      <link>https://dev.to/varunvarde/your-first-llmops-pipeline-from-prompt-to-production-in-one-sprint-4ppp</link>
      <guid>https://dev.to/varunvarde/your-first-llmops-pipeline-from-prompt-to-production-in-one-sprint-4ppp</guid>
      <description>&lt;p&gt;AI applications don’t behave like traditional systems. They don’t fail cleanly. They don’t produce identical outputs for identical inputs. And they don’t lend themselves to binary testing pass or fail.&lt;/p&gt;

&lt;p&gt;Instead, they operate in gradients. Probabilities. Trade-offs.&lt;/p&gt;

&lt;p&gt;That is precisely why applying standard DevOps or MLOps practices without adaptation often leads to brittle pipelines and unreliable outcomes.&lt;/p&gt;

&lt;p&gt;This guide walks through a complete LLMOps pipeline practical, production-ready, and deployable within a single sprint.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLMOps vs MLOps vs DevOps - The Operational Model Differences
&lt;/h2&gt;

&lt;p&gt;Traditional DevOps assumes determinism&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input → Code → Output (predictable)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;MLOps introduces probabilistic behavior but still focuses on trained models&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input → Model → Prediction (statistical)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LLMOps shifts the paradigm further&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input → Prompt + Model → Generated Output (non-deterministic)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key distinctions&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Outputs vary even with identical inputs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prompt design is as critical as code&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Latency and cost are tied to tokens, not just compute&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This necessitates new operational primitives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt Versioning: Treating Prompts as Code
&lt;/h2&gt;

&lt;p&gt;Prompts are no longer ephemeral strings. They are artifacts.&lt;/p&gt;

&lt;p&gt;Store them in Git&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/prompts/
  summarization/
    v1.0.0.txt
    v1.1.0.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example prompt&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# v2.3.1
Summarize the following text in 3 bullet points with a professional tone:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reference prompts explicitly in code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PROMPT_VERSION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2.3.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompts/summarization/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROMPT_VERSION&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prompt_template&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Never use latest. Ambiguity is the enemy of reproducibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation Frameworks: How to Test LLM Outputs
&lt;/h2&gt;

&lt;p&gt;Testing LLMs requires nuance. Exact matches are rare. Evaluation must be semantic.&lt;/p&gt;

&lt;p&gt;Example using a scoring function&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;similarity_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dataset-driven testing&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Explain Kubernetes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Container orchestration platform"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run batch evaluations&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;python evaluate.py --dataset test_cases.json
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Metrics to track&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Relevance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Coherence&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hallucination rate&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Testing becomes statistical—not absolute.&lt;/p&gt;

&lt;h2&gt;
  
  
  CI/CD for LLM Applications: What to Run on Every PR
&lt;/h2&gt;

&lt;p&gt;CI pipelines must evolve.&lt;/p&gt;

&lt;p&gt;A minimal LLM CI pipeline&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM CI&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python evaluate.py&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python lint_prompts.py&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python cost_estimator.py&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Checks include&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Prompt syntax validation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Regression detection in outputs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cost estimation per request&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A failing evaluation blocks the merge. Quality is enforced early.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Patterns: Blue-Green and Canary
&lt;/h2&gt;

&lt;p&gt;Non-determinism demands cautious rollout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blue-Green Deployment&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: v1 (blue)
version: v2 (green)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Switch traffic atomically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Canary Deployment&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;traffic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;v1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;90%&lt;/span&gt;
  &lt;span class="na"&gt;v2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Monitor performance before full rollout.&lt;/p&gt;

&lt;p&gt;Example Kubernetes snippet&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-service-v2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Observe behavior before committing fully.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability: Traces, Latency, and Token Costs
&lt;/h2&gt;

&lt;p&gt;Observability must capture more than uptime.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tracing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;

&lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Metrics&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="nb"&gt;histogram_quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm_latency_seconds_bucket&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5m&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cost Tracking&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;increase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm_tokens_total&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1h&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.000002&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dashboards should answer&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;How fast?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How expensive?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How reliable?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Guardrails: Output Validation and Fallback Chains
&lt;/h2&gt;

&lt;p&gt;LLMs can produce unexpected outputs. Guardrails mitigate risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validation Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;forbidden_word&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Fallback Chain&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_primary_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_secondary_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Content Filtering&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;toxicity_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content not allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Guardrails are not optional. They are essential.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Controls: Token Budgets and Rate Limiting
&lt;/h2&gt;

&lt;p&gt;Costs scale with usage. Left unchecked, they escalate rapidly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token Limits&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MAX_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rate Limiting&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;requests_per_minute&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;reject_request&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Budget Enforcement&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;monthly_tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;disable_non_critical_features&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost awareness must be embedded in the system—not retrofitted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human-in-the-Loop Workflows
&lt;/h2&gt;

&lt;p&gt;For high-stakes decisions, automation alone is insufficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approval Workflow&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM Output → Human Review → Final Decision
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Queue System&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;confidence_score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;send_to_review_queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Humans provide judgment where models provide probability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Complete Example: Production-Ready LLM Pipeline on Kubernetes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# llm-pipeline-values.yaml — Kubernetes deployment with cost + observability&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-service&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;your-org/llm-service:v1.2.0&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MAX_TOKENS_PER_REQUEST&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2000"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MONTHLY_TOKEN_BUDGET&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10000000"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://otel-collector:4317"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PROMPT_VERSION&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2.3.1"&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;256Mi"&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100m"&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.coreos.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PrometheusRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-cost-alerts&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm_cost&lt;/span&gt;
      &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLMDailySpendHigh&lt;/span&gt;
          &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(increase(llm_tokens_total[24h])) * 0.000002 &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;
          &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LLM&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;daily&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;spend&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exceeding&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$50&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;threshold"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration encapsulates&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Versioned prompts&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Observability hooks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cost safeguards&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scalable deployment&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLMOps is not an extension of DevOps. It is a rethinking.&lt;/p&gt;

&lt;p&gt;Systems are no longer deterministic. Testing is no longer binary. Costs are no longer predictable.&lt;/p&gt;

&lt;p&gt;Yet, with the right structure versioning, evaluation, observability, and control—the uncertainty becomes manageable. Even advantageous.&lt;/p&gt;

&lt;p&gt;A well-designed LLMOps pipeline does not eliminate unpredictability. It harnesses it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>llmops</category>
    </item>
    <item>
      <title>Building Production-Grade Observability: OpenTelemetry + Grafana Stack</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Tue, 14 Apr 2026 09:05:05 +0000</pubDate>
      <link>https://dev.to/varunvarde/building-production-grade-observability-opentelemetry-grafana-stack-9mc</link>
      <guid>https://dev.to/varunvarde/building-production-grade-observability-opentelemetry-grafana-stack-9mc</guid>
      <description>&lt;p&gt;Stop guessing what's broken in production. Here's a complete, deploy-it-this-week observability stack built on OpenTelemetry and Grafana — the same stack I've deployed for three clients in the last 18 months.&lt;/p&gt;

&lt;p&gt;This isn't a toy setup. This is production-grade: traces, metrics, and logs unified under a single pane of glass, with auto-instrumentation for the most common runtimes, alerting that pages on symptoms not causes, and dashboards your non-SRE teammates can actually read.&lt;/p&gt;

&lt;p&gt;What you'll build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;OpenTelemetry Collector (gateway mode) for vendor-agnostic telemetry collection&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Grafana Tempo for distributed tracing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prometheus + Grafana Mimir for metrics at scale&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loki for structured log aggregation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Grafana dashboards with pre-built SLO panels&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;AlertManager rules tied to error budgets&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prerequisites: Kubernetes 1.25+, Helm 3, basic familiarity with YAML. Estimated time: 3–5 hours end to end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why OpenTelemetry? The vendor-lock argument settled once and for all
&lt;/h2&gt;

&lt;p&gt;You’ve heard it before: “Just use Datadog.” Then the bill arrives. Or “Use Prometheus alone.” Then you lose traces.&lt;/p&gt;

&lt;p&gt;OpenTelemetry (OTel) is the single CNCF standard for generating and exporting telemetry data. Here’s why it wins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;One instrumentation, many backends: Instrument your app once with OTel SDKs. Send to Tempo, Jaeger, Datadog, or New Relic simultaneously.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No vendor lock-in: Your telemetry data remains in your control (S3 for traces, block storage for metrics).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Automatic context propagation: Trace IDs flow seamlessly across services, even across different languages (Java → Python → Node.js).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Future-proof: New backends emerge? Point your OTel Collector there. No code changes.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bottom line: OTel is the USB-C of observability. Stop writing custom exporters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture overview: Collector, Backends, Visualization
&lt;/h2&gt;

&lt;p&gt;Here’s what you’re deploying:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Your App] --(OTLP)--&amp;gt; [OTel Collector (Gateway)] --+--&amp;gt; [Tempo] (traces)
                                                      +--&amp;gt; [Mimir] (metrics)
                                                      +--&amp;gt; [Loki] (logs)
                                                              |
                                                         [Grafana] (visualization)
                                                              |
                                                       [AlertManager] (paging)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;OTel Collector (Gateway mode): Receives OTLP from all services. Validates, batches, and routes telemetry. Single ingress point.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tempo: Object-storage-backed tracing. Cheap, scalable, no indexing costs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mimir: Horizontally scalable Prometheus-compatible metrics store.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loki: Log aggregation with low-cost object storage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Grafana: Unified UI with Explore, dashboards, and alerting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;AlertManager: Deduplicates, groups, and routes alerts to PagerDuty/Slack.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Storage requirements (minimal): 50GB for Loki, 100GB for Tempo (can use S3/GCS/MinIO), 50GB for Mimir.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing the OTel Collector (gateway mode Helm values)
&lt;/h2&gt;

&lt;p&gt;Create otel-collector-values.yaml&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deployment&lt;/span&gt;   &lt;span class="c1"&gt;# gateway mode (as opposed to daemonset for agent mode)&lt;/span&gt;

&lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4317&lt;/span&gt;
        &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;

  &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1s&lt;/span&gt;
      &lt;span class="na"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1024&lt;/span&gt;
    &lt;span class="na"&gt;memory_limiter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;check_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1s&lt;/span&gt;
      &lt;span class="na"&gt;limit_mib&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;512&lt;/span&gt;
    &lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;environment&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
          &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;upsert&lt;/span&gt;

  &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;otlp/tempo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tempo-distributor:4317"&lt;/span&gt;
      &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;prometheusremotewrite/mimir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://mimir-distributor:8080/api/v1/push"&lt;/span&gt;
    &lt;span class="na"&gt;loki&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://loki-gateway:3100/loki/api/v1/push"&lt;/span&gt;

  &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;memory_limiter&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;attributes&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp/tempo&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;memory_limiter&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;prometheusremotewrite/mimir&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;memory_limiter&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;loki&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; otel-collector open-telemetry/opentelemetry-collector &lt;span class="nt"&gt;-f&lt;/span&gt; otel-collector-values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Auto-instrumentation: Java, Python, Node.js, Go
&lt;/h2&gt;

&lt;p&gt;No code changes for traces/metrics/logs. Use OTel's auto-instrumentation agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Java (Spring Boot, any JVM app)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; JAVA_TOOL_OPTIONS="-javaagent:/otel/opentelemetry-javaagent.jar"&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; OTEL_SERVICE_NAME=payment-service&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python (Django, Flask, FastAPI)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="nb"&gt;install
&lt;/span&gt;otel-instrument &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service_name&lt;/span&gt; checkout-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--exporter_otlp_endpoint&lt;/span&gt; http://otel-collector:4317 &lt;span class="se"&gt;\&lt;/span&gt;
  python app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Node.js (Express, NestJS)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @opentelemetry/auto-instrumentations-node
npx opentelemetry-instrument &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;api-gateway &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--exporter_otlp_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://otel-collector:4317 &lt;span class="se"&gt;\&lt;/span&gt;
  node server.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Go (manual instrumentation required, but minimal)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"go.opentelemetry.io/otel"&lt;/span&gt;
    &lt;span class="s"&gt;"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;initTracer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;otlptracegrpc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;otlptracegrpc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithEndpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"otel-collector:4317"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;otlptracegrpc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithInsecure&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="c"&gt;// ... standard setup (5 lines)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify: Check Collector logs for TraceID spans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying Tempo for distributed tracing
&lt;/h2&gt;

&lt;p&gt;Tempo is designed for cost-effective tracing. It stores traces in object storage (S3/MinIO) and indexes only by trace ID.&lt;/p&gt;

&lt;p&gt;tempo-values.yaml&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tempo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;trace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3&lt;/span&gt;
      &lt;span class="na"&gt;s3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;bucket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tempo-traces&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;minio.minio:9000&lt;/span&gt;
        &lt;span class="na"&gt;access_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minioadmin"&lt;/span&gt;
        &lt;span class="na"&gt;secret_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minioadmin"&lt;/span&gt;
        &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;pool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;max_workers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
        &lt;span class="na"&gt;queue_depth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;

  &lt;span class="na"&gt;overrides&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;defaults&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ingestion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;rate_limit_bytes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15000000&lt;/span&gt;   &lt;span class="c1"&gt;# 15MB/s&lt;/span&gt;
        &lt;span class="na"&gt;burst_size_bytes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20000000&lt;/span&gt;

&lt;span class="na"&gt;distributor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0:4317"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; tempo grafana/tempo &lt;span class="nt"&gt;-f&lt;/span&gt; tempo-values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query Tempo from Grafana: Add data source → Tempo → URL: &lt;a href="http://tempo-query-frontend:16686" rel="noopener noreferrer"&gt;http://tempo-query-frontend:16686&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Prometheus + Mimir for long-term metrics storage
&lt;/h2&gt;

&lt;p&gt;Mimir replaces single-instance Prometheus. It provides horizontal scaling, replication, and long-term retention.&lt;/p&gt;

&lt;p&gt;mimir-values.yaml&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;mimir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;structuredConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;blocks_storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3&lt;/span&gt;
      &lt;span class="na"&gt;s3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;minio.minio:9000&lt;/span&gt;
        &lt;span class="na"&gt;bucket_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mimir-blocks&lt;/span&gt;
        &lt;span class="na"&gt;access_key_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minioadmin"&lt;/span&gt;
        &lt;span class="na"&gt;secret_access_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minioadmin"&lt;/span&gt;
        &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;ingester&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;replication_factor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;   &lt;span class="c1"&gt;# for HA&lt;/span&gt;
    &lt;span class="na"&gt;ruler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;rule_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/data/rules&lt;/span&gt;
      &lt;span class="na"&gt;alertmanager_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://alertmanager:9093&lt;/span&gt;

  &lt;span class="na"&gt;ingester&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;distributor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;querier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; mimir grafana/mimir &lt;span class="nt"&gt;-f&lt;/span&gt; mimir-values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Migrate existing Prometheus data&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;promtool tsdb create-blocks-from-rules &lt;span class="nt"&gt;--rules-file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;recording-rules.yaml data/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then point Prometheus remote write to &lt;a href="http://mimir-distributor:8080/api/v1/push" rel="noopener noreferrer"&gt;http://mimir-distributor:8080/api/v1/push&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loki for log aggregation with structured querying
&lt;/h2&gt;

&lt;p&gt;Loki is like Prometheus for logs. It indexes only labels, not full text, making it cheap at scale.&lt;/p&gt;

&lt;p&gt;loki-values.yaml&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;loki&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3&lt;/span&gt;
    &lt;span class="na"&gt;s3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;minio.minio:9000&lt;/span&gt;
      &lt;span class="na"&gt;bucketnames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loki-chunks&lt;/span&gt;
      &lt;span class="na"&gt;access_key_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minioadmin"&lt;/span&gt;
      &lt;span class="na"&gt;secret_access_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minioadmin"&lt;/span&gt;
      &lt;span class="na"&gt;s3forcepathstyle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="na"&gt;schemaConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2024-01-01&lt;/span&gt;
        &lt;span class="na"&gt;store&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;boltdb-shipper&lt;/span&gt;
        &lt;span class="na"&gt;object_store&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3&lt;/span&gt;
        &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v12&lt;/span&gt;
        &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loki_index_&lt;/span&gt;
          &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;24h&lt;/span&gt;

  &lt;span class="na"&gt;limits_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ingestion_rate_mb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;ingestion_burst_size_mb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
    &lt;span class="na"&gt;max_global_streams_per_user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;

  &lt;span class="na"&gt;chunk_store_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;max_look_back_period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;672h&lt;/span&gt;  &lt;span class="c1"&gt;# 28 days&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; loki grafana/loki &lt;span class="nt"&gt;-f&lt;/span&gt; loki-values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Query example (LogQL)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{namespace="production", app="payment-service"} |= "error" 
| json 
| latency_ms &amp;gt; 500 
| line_format "{{.trace_id}} - {{.message}}"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Grafana: Connecting all three data sources
&lt;/h2&gt;

&lt;p&gt;grafana-values.yaml&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;datasources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;datasources.yaml&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;datasources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prometheus-Mimir&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
        &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://mimir-query-frontend:8080/prometheus&lt;/span&gt;
        &lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
        &lt;span class="na"&gt;isDefault&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Tempo&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tempo&lt;/span&gt;
        &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://tempo-query-frontend:16686&lt;/span&gt;
        &lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
        &lt;span class="na"&gt;jsonData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;tracesToLogs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;loki'&lt;/span&gt;
            &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;service.name'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pod'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;serviceMap&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Loki&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loki&lt;/span&gt;
        &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://loki-gateway:3100&lt;/span&gt;
        &lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
        &lt;span class="na"&gt;jsonData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;derivedFields&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trace_id&lt;/span&gt;
              &lt;span class="na"&gt;matcherRegex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trace_id=(\w+)'&lt;/span&gt;
              &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;$${__value.raw}'&lt;/span&gt;
              &lt;span class="na"&gt;datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tempo'&lt;/span&gt;

&lt;span class="na"&gt;dashboardProviders&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dashboardproviders.yaml&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;slo'&lt;/span&gt;
        &lt;span class="na"&gt;orgId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
        &lt;span class="na"&gt;folder&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SLO&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Dashboards'&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
        &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/lib/grafana/dashboards&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; grafana grafana/grafana &lt;span class="nt"&gt;-f&lt;/span&gt; grafana-values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test correlation: In Loki, find a log with trace_id=abc123. Click it → jumps to Tempo trace. In Tempo, see affected service → jumps to Mimir metrics for that service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building your first SLO dashboard (template included)
&lt;/h2&gt;

&lt;p&gt;Save as slo-dashboard.json and mount into Grafana&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SLO Dashboard - Payment Service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"panels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Availability (30d SLI)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"expr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sum(rate(http_requests_total{status!~'5..'}[$__range])) / sum(rate(http_requests_total[$__range]))"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"legendFormat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Availability SLI"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"thresholds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"red"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"op"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"valueType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"absolute"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.995&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"yellow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"op"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"valueType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"absolute"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.999&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"green"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"op"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gte"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"valueType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"absolute"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.999&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Error Budget Remaining (30d)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"expr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"(1 - (sum(rate(http_requests_total{status=~'5..'}[30d])) / sum(rate(http_requests_total[30d])))) / 0.999"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"legendFormat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Budget remaining"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"fieldConfig"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"defaults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"percentunit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"min"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"max"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"thresholds"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"thresholds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"red"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"op"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"yellow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"op"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"green"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"op"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gte"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Latency P99 (30d SLI)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"expr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__range])) by (le))"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"legendFormat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"P99 latency"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SLO math explained&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Availability target: 99.9% → error budget = 0.1% of requests can fail.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Budget remaining: (actual_availability - target) / (1 - target) → 1.0 means on track, 0 means exhausted.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AlertManager: Alerting on symptoms, not causes
&lt;/h2&gt;

&lt;p&gt;Bad alert: &lt;em&gt;"CPU on pod payment-7d8f9 is 92%"&lt;/em&gt; (cause)&lt;br&gt;
Good alert: "Payment service error budget exhausted" (symptom)&lt;/p&gt;

&lt;p&gt;alertmanager-config.yaml&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;group_by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;alertname'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;service'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;group_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
  &lt;span class="na"&gt;group_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
  &lt;span class="na"&gt;repeat_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;12h&lt;/span&gt;
  &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pagerduty-critical'&lt;/span&gt;
  &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pagerduty-critical&lt;/span&gt;
      &lt;span class="na"&gt;continue&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slack-warnings&lt;/span&gt;

&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pagerduty-critical'&lt;/span&gt;
    &lt;span class="na"&gt;pagerduty_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;service_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;your-pd-key&amp;gt;&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;slack-warnings'&lt;/span&gt;
    &lt;span class="na"&gt;slack_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;api_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;webhook&amp;gt;&lt;/span&gt;
        &lt;span class="na"&gt;channel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;#alerts-warning'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Prometheus alerting rule example&lt;/strong&gt; (slo-alerts.yaml)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slo&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ErrorBudgetExhausted&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;(1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) &lt;/span&gt;
          &lt;span class="s"&gt;/ sum(rate(http_requests_total[30d])))) / 0.999 &amp;lt; 0.2&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{$labels.service}}"&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;budget&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{$labels.service}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;below&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;20%"&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Remaining&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;budget:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;humanizePercentage}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create configmap alertmanager-config &lt;span class="nt"&gt;--from-file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;alertmanager.yaml&lt;span class="o"&gt;=&lt;/span&gt;alertmanager-config.yaml
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; prometheus prometheus-community/prometheus &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; alertmanager.enabled&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; alertmanager.configFromSecret&lt;span class="o"&gt;=&lt;/span&gt;alertmanager-config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The 3 dashboards every on-call engineer needs
&lt;/h2&gt;

&lt;p&gt;Stop building 50-panel dashboards. Start with these three.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dashboard 1: Service Health (RED method)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Rate (requests per second) per endpoint&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Errors (5xx rate, grouped by status code)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Duration (P50, P95, P99 latency)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Saturation (CPU/memory per pod, queue depth)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PromQL snippets&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Rate
sum(rate(http_requests_total[1m])) by (service, endpoint)

# Error ratio
sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))

# P99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dashboard 2: Trace Explorer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Top 10 slowest traces in last hour&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Trace heatmap (duration vs. timestamp)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Service dependency graph (from Tempo service graph)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;High-error traces panel (filter by status.error=true)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dashboard 3: The "Burndown" Chart&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Error budget remaining (daily trend line)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;SLO burn rate (1h, 6h, 24h windows)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-burn alert status (green/yellow/red)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Top offending services by error budget consumption&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why this works: On-call opens Dashboard 1 → sees elevated latency → clicks a trace in Dashboard 2 → finds slow database query → checks Dashboard 3 to decide if paging SREs is urgent.&lt;/p&gt;

&lt;p&gt;Final checklist for production readiness&lt;/p&gt;

&lt;p&gt;Before you sleep soundly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Ingestion testing: curl a test span/metric/log through the Collector.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Retention: Set Mimir 30d, Tempo 14d, Loki 30d (adjust to compliance).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Auth: Add Grafana OAuth (Google/GitHub) and basic auth for Mimir/Loki ingesters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Backups: Object storage (MinIO/S3) should have versioning enabled.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alert testing: Silence a service, verify PagerDuty gets the page.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Runbook: Link each alert to a Confluence doc (e.g., "ErrorBudgetExhausted → &lt;a href="https://wiki/runbooks/slo%22" rel="noopener noreferrer"&gt;https://wiki/runbooks/slo"&lt;/a&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What’s next? Add OpenTelemetry for your database (PostgreSQL, Redis, MongoDB) using OTel collector receivers. Or add synthetic monitoring with Blackbox exporter.&lt;/p&gt;

&lt;p&gt;You now have the same stack that cost my clients $0/month (excluding storage) instead of $15k/month for Datadog. Ship it.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>sre</category>
      <category>grafana</category>
    </item>
    <item>
      <title>From GitHub Actions to ArgoCD: A Complete GitOps Migration Guide</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Tue, 07 Apr 2026 13:33:40 +0000</pubDate>
      <link>https://dev.to/varunvarde/from-github-actions-to-argocd-a-complete-gitops-migration-guide-4kd4</link>
      <guid>https://dev.to/varunvarde/from-github-actions-to-argocd-a-complete-gitops-migration-guide-4kd4</guid>
      <description>&lt;p&gt;I've done this migration three times now. Each time, I learn something new about what NOT to do. This guide is the documentation I wish I'd had the first time: the complete migration from a multi-stage GitHub Actions deployment pipeline to a fully declarative GitOps workflow with ArgoCD including the parts every other tutorial glosses over.&lt;/p&gt;

&lt;p&gt;What you'll build: Production ready ArgoCD installation, ApplicationSet for multi cluster management, Image Updater for automated rollouts, RBAC configuration for multi-team environments, Rollback procedures that work under pressure.&lt;/p&gt;

&lt;p&gt;Prerequisites: Kubernetes cluster (1.24+), kubectl, Helm 3, existing GitHub Actions CI pipeline. Estimated time: 4-6 hours for learning setup (30 days for production migration).&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Assessment — Mapping Your Existing GitHub Actions Pipeline
&lt;/h2&gt;

&lt;p&gt;Every successful migration begins with clarity. Before introducing GitOps, the existing CI/CD topology must be dissected—methodically, not casually.&lt;/p&gt;

&lt;p&gt;Start by inventorying workflows&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/deploy.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy Service&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t myapp .&lt;/span&gt;

  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubectl apply -f k8s/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pipeline reveals implicit assumptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Deployment is tightly coupled with CI&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kubernetes manifests are applied imperatively&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No state reconciliation exists&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Map each component&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Build Step&lt;/td&gt;
&lt;td&gt;Artifact creation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deploy Step&lt;/td&gt;
&lt;td&gt;Cluster mutation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets&lt;/td&gt;
&lt;td&gt;Runtime configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The goal is not replication but transformation. GitOps demands declarative intent, not procedural execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: ArgoCD Installation (Production-Grade Helm Values)
&lt;/h2&gt;

&lt;p&gt;A default installation is rarely sufficient. Production demands rigor secure ingress, RBAC, and resource constraints.&lt;/p&gt;

&lt;p&gt;Install ArgoCD using Helm&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add argo https://argoproj.github.io/argo-helm
helm &lt;span class="nb"&gt;install &lt;/span&gt;argocd argo/argo-cd &lt;span class="nt"&gt;-n&lt;/span&gt; argocd &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A hardened values.yaml might resemble&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;

  &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;argocd.example.com&lt;/span&gt;

&lt;span class="na"&gt;configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;server.insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

  &lt;span class="na"&gt;cm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://argocd.example.com&lt;/span&gt;

&lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;controller&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500m&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Security is not ornamental. It is foundational. Misconfigured access at this stage invites future instability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 3: Structuring Your GitOps Repository
&lt;/h2&gt;

&lt;p&gt;Git becomes the single source of truth. Structure, therefore, becomes paramount.&lt;/p&gt;

&lt;p&gt;A canonical layout&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gitops-repo/
├── apps/
│   ├── payments/
│   ├── auth/
├── environments/
│   ├── dev/
│   ├── staging/
│   └── prod/
└── infrastructure/
    ├── ingress/
    └── monitoring/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each directory encapsulates intent.&lt;/p&gt;

&lt;p&gt;Example application manifest&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/org/gitops-repo&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/payments&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
  &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Declarative. Predictable. Auditable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 4: ApplicationSet Configuration for Multi-Team
&lt;/h2&gt;

&lt;p&gt;At scale, manually defining applications becomes untenable. ApplicationSet introduces dynamic orchestration.&lt;/p&gt;

&lt;p&gt;A Git generator example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApplicationSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-apps&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;generators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;git&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/org/gitops-repo&lt;/span&gt;
        &lt;span class="na"&gt;revision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
        &lt;span class="na"&gt;directories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/*&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{path.basename}}'&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/org/gitops-repo&lt;/span&gt;
        &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{path}}'&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{path.basename}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Teams simply add a folder. The system reacts autonomously.&lt;/p&gt;

&lt;p&gt;This is where GitOps begins to feel almost sentient responsive, adaptive, yet deterministic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 5: Migrating Your First Service (with Rollback Plan)
&lt;/h2&gt;

&lt;p&gt;Start small. Select a non-critical service. Minimize risk while maximizing insight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Remove Deploy Step from CI&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Keep build, remove kubectl apply&lt;/span&gt;
- run: docker build &lt;span class="nt"&gt;-t&lt;/span&gt; myapp &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Push Kubernetes Manifests to GitOps Repo&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Let ArgoCD Sync&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ArgoCD reconciles desired state with actual state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rollback Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rollback becomes trivial&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git revert &amp;lt;commit-hash&amp;gt;
git push origin main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No manual intervention. No frantic patching. Just version control.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 6: Image Updater for CI Integration
&lt;/h2&gt;

&lt;p&gt;CI still builds images. But deployment is decoupled.&lt;/p&gt;

&lt;p&gt;ArgoCD Image Updater bridges the gap&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd-image-updater.argoproj.io/image-list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp=repo/myapp&lt;/span&gt;
    &lt;span class="na"&gt;argocd-image-updater.argoproj.io/update-strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pipeline pushes image&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker push repo/myapp:1.2.3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Image Updater modifies Git&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;repo/myapp:1.2.3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ArgoCD syncs automatically.&lt;/p&gt;

&lt;p&gt;Elegant. Autonomous.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 7: RBAC and Multi-Tenant Configuration
&lt;/h2&gt;

&lt;p&gt;As teams proliferate, governance becomes essential.&lt;/p&gt;

&lt;p&gt;Define roles&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd-rbac-cm&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;policy.csv&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;p, role:dev, applications, get, */*, allow&lt;/span&gt;
    &lt;span class="s"&gt;p, role:dev, applications, sync, */*, allow&lt;/span&gt;
    &lt;span class="s"&gt;g, team-dev, role:dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Namespace isolation&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-a&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each team operates within boundaries autonomous yet controlled.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 8: Observability for Your GitOps Pipelines
&lt;/h2&gt;

&lt;p&gt;Visibility transforms guesswork into certainty.&lt;/p&gt;

&lt;p&gt;Enable metrics&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;controller&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scrape with Prometheus&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;argocd'&lt;/span&gt;
  &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;argocd-metrics:8082'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key metrics&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Sync status&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reconciliation duration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Drift detection&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Logs provide narrative&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs deployment/argocd-application-controller
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without observability, GitOps becomes a black box. With it, a glass box.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 Things No Tutorial Mentions (Learned the Hard Way)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Git Becomes Your Bottleneck&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Frequent commits can overwhelm repositories. Optimize branching strategies. Avoid unnecessary churn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Drift Happens More Than Expected&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Manual changes in clusters still occur. ArgoCD corrects them—but not always instantly. Expect transient inconsistencies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Secrets Are a Persistent Challenge&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Plain YAML is insufficient. Integrate tools like sealed secrets or external secret managers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. ApplicationSet Can Spiral into Complexity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dynamic generation is powerful—but dangerous. Poor templates create cascading misconfigurations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Human Behavior Is the Hardest Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Engineers accustomed to imperative deployments resist change. Training is not optional. It is essential.&lt;/p&gt;

&lt;p&gt;Migrating to ArgoCD is not merely a tooling shift. It is a philosophical transition—from imperative control to declarative intent.&lt;/p&gt;

&lt;p&gt;The rewards are substantial. Stability. Traceability. Confidence.&lt;/p&gt;

&lt;p&gt;But the journey demands discipline. And a willingness to relinquish old habits in favor of something far more resilient.&lt;/p&gt;

</description>
      <category>gitops</category>
      <category>argocd</category>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Zero Downtime Cloud Migration: The 6-Phase Playbook</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Thu, 26 Mar 2026 08:27:40 +0000</pubDate>
      <link>https://dev.to/varunvarde/zero-downtime-cloud-migration-the-6-phase-playbook-13p8</link>
      <guid>https://dev.to/varunvarde/zero-downtime-cloud-migration-the-6-phase-playbook-13p8</guid>
      <description>&lt;p&gt;Phase 1 is never 'lift and shift.' Here's the framework that keeps production stable throughout the entire move.&lt;/p&gt;

&lt;p&gt;After leading migrations for organizations ranging from 200 to 4,000 engineers, I've distilled it into 6 phases that keep production alive and teams sane.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1: Discovery &amp;amp; Dependency Mapping&lt;/strong&gt;&lt;br&gt;
Before touching a single VM, audit everything.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Inventory all services: apps, databases, middleware, integrations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Map inter service dependencies including the undocumented ones (ask the engineers who've been there longest)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tag everything by criticality, data sensitivity, migration complexity&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Identify the "spiders in the web" services everything else depends on&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools: AWS Application Discovery Service, Cloudamize, or a structured spreadsheet + interviews.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2: Cloud Foundation &amp;amp; Landing Zone&lt;/strong&gt;&lt;br&gt;
Build the platform before you migrate anything.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Set up your VPC architecture (hub spoke or flat decide now)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Implement IAM roles, SCPs, and guardrails&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deploy centralized logging, monitoring, and alerting&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Establish your IaC standard (Terraform modules, Pulumi stacks — standardize before day one)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set cost budgets and alerts BEFORE anything runs&lt;br&gt;
Never skip this. Teams that migrate first and "sort out governance later" always regret it.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Phase 3: Pilot Migration (Noncritical workloads)&lt;/strong&gt;&lt;br&gt;
Start small. Build confidence.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Choose a non-critical, low traffic service&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Run full migration lifecycle: move → validate → monitor → optimize&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Document everything. This becomes your runbook for every subsequent migration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Identify gaps in tooling and process while the stakes are low&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Phase 4: Wave Based Migration&lt;/strong&gt;&lt;br&gt;
Organize workloads into migration waves by complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Wave 1: Stateless apps (easiest)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wave 2: Stateful apps with managed DB alternatives&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wave 3: Complex integrations and legacy services&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wave 4 (optional): Services that require refactoring before migration&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run each wave over 2–4-week sprints. Include a 2-week stabilization window after each wave before proceeding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 5: Cutover &amp;amp; Traffic Management&lt;/strong&gt;&lt;br&gt;
Zero-downtime means dual running, not big-bang.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Use feature flags and DNS-based traffic shifting (Route 53 weighted routing or equivalent)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Implement a circuit breaker to instantly roll back traffic if error rates spike&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Keep on prem running in parallel for 30–60 days post cutover&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Don't decommission anything until you've run at least one full billing cycle on cloud&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Phase 6: Optimize &amp;amp; Decommission&lt;/strong&gt;&lt;br&gt;
The migration isn't over when the apps are running.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Right-size instances based on real usage data (wait at least 2 weeks)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Implement autoscaling where you were running fixed capacity&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set up FinOps dashboards cloud spend will be surprising at first&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Decommission on prem systematically, not in a rush&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Run a migration retrospective with every team involved&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Rule I Never Break&lt;/strong&gt;&lt;br&gt;
No service goes to production without observability in place (logs, metrics, traces), a documented runbook, an on-call rotation assigned, and a tested rollback procedure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Save for your next migration kickoff&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>cloudskills</category>
      <category>aws</category>
      <category>webpack</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Terraform Modules That Actually Scale: Patterns from 20 Years</title>
      <dc:creator>varun varde</dc:creator>
      <pubDate>Wed, 18 Mar 2026 08:47:03 +0000</pubDate>
      <link>https://dev.to/varunvarde/terraform-modules-that-actually-scale-patterns-from-20-years-35j7</link>
      <guid>https://dev.to/varunvarde/terraform-modules-that-actually-scale-patterns-from-20-years-35j7</guid>
      <description>&lt;p&gt;Provisioning was once an artisanal craft defined by shell scripts and fragile golden images. We've since pivoted to the declarative, idempotent paradigm of Terraform. But scaling isn't just about resource count it’s about managing the cognitive load of massive environments. At some point, the question stops being how do I write HCL? and becomes how do I stop this complexity from collapsing under its own weight?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Monolith Trap: Why Large State Files Kill Velocity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Mega State is the ultimate architectural debt. Dumping your VPC, EKS cluster, and RDS instances into one root module creates a catastrophic blast radius. One syntax error in a security group shouldn't lock the state file for your entire production database. Scaling demands decoupling. Infrastructure components must be treated as independent lifecycles. Networking is slow and steady app pods are ephemeral. Isolate them to minimize plan times and risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strict Typing and Input Orthogonality&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ditch type = any. It’s lazy and dangerous. High utility modules require strict object validation to fail fast. If the calling code sends a malformed schema, the plan should die immediately not thirty minutes into a deployment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"cluster_config"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"EKS spec"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;version&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;
    &lt;span class="nx"&gt;vpc_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;
    &lt;span class="nx"&gt;subnet_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;node_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="nx"&gt;instance_types&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="nx"&gt;capacity_type&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;}))&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;

  &lt;span class="nx"&gt;validation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;condition&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s2"&gt;"1.28"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"1.29"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"1.30"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;error_message&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"K8s version restricted by organizational compliance."&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Compositional Strategy: Favoring Layering over Nesting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Nesting modules inside modules leads to Prop Drilling a nightmare where a variable is passed through five layers of code just to reach a single resource. It's brittle.&lt;/p&gt;

&lt;p&gt;Layering is the professional choice. Keep modules flat and use a Composition Root to stitch outputs to inputs. This keeps the logic readable and the dependencies explicit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross Account Orchestration via Provider Aliasing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Enterprise scale means multiple regions and hundreds of AWS accounts. Never hardcode provider blocks inside a module; it makes them non portable. Use Provider Aliases. This allows you to instantiate the same module across different regions within a single run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Instance in US-East&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"s3_east"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./modules/s3"&lt;/span&gt;
  &lt;span class="nx"&gt;providers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;aws&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;us-east-1&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Instance in US-West&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"s3_west"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./modules/s3"&lt;/span&gt;
  &lt;span class="nx"&gt;providers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;aws&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;us-west-2&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Immutable Versioning and Private Registries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Local file paths are for hobbyists. In production, referencing a local path means every save on your laptop could theoretically break a CI/CD pipeline. Scaling requires immutable versioning. Use Git tags or a Private Registry. It’s the only way to ensure Production stays on v1.2.0 while Dev experiments with v2.0.0 beta.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automated Validation: Terratest and OPA Integration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you aren't testing, you're guessing. For core modules, Terratest is the standard. It spins up real resources, pings them to ensure they work, and tears them down. Couple this with Open Policy Agent (OPA) to programmatically block anyone from creating an unencrypted volume or a public S3 bucket before the code even leaves the PR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 2 Operations: Refactoring with moved Blocks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Refactoring used to mean terraform state mv commands that risked corrupting the remote backend. Now, we have the moved block. It allows you to move resources into modules without triggering a "destroy and recreate" cycle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;moved&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;from&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;legacy_app&lt;/span&gt;
  &lt;span class="nx"&gt;to&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;app_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how you turn a tangled mess of HCL into a robust ecosystem. It’s about predictability, isolation, and strict interfaces.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>devops</category>
      <category>webdev</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
