<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sunny Nazar</title>
    <description>The latest articles on DEV Community by Sunny Nazar (@sunnynazar).</description>
    <link>https://dev.to/sunnynazar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F823209%2Fb570381f-e6dd-45a7-b723-4e5e8709023e.jpeg</url>
      <title>DEV Community: Sunny Nazar</title>
      <link>https://dev.to/sunnynazar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sunnynazar"/>
    <language>en</language>
    <item>
      <title>The Complete Guide to Prometheus Metric Types</title>
      <dc:creator>Sunny Nazar</dc:creator>
      <pubDate>Sun, 11 Jan 2026 20:42:05 +0000</pubDate>
      <link>https://dev.to/sunnynazar/the-complete-guide-to-prometheus-metric-types-promql-alerting-and-troubleshooting-5a69</link>
      <guid>https://dev.to/sunnynazar/the-complete-guide-to-prometheus-metric-types-promql-alerting-and-troubleshooting-5a69</guid>
      <description>&lt;h2&gt;
  
  
  The Complete Guide to Prometheus Metric Types: PromQL, Alerting and Troubleshooting
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Reading Time&lt;/strong&gt;: 15 minutes  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The 3 AM Call&lt;/li&gt;
&lt;li&gt;Quick Reference Card&lt;/li&gt;
&lt;li&gt;Which Metric Type Should I Use&lt;/li&gt;
&lt;li&gt;
Meet the Four Metric Types

&lt;ul&gt;
&lt;li&gt;Counter: The Tireless Bookkeeper&lt;/li&gt;
&lt;li&gt;Gauge: The Live Reporter&lt;/li&gt;
&lt;li&gt;Histogram: The Distribution Detective&lt;/li&gt;
&lt;li&gt;Summary: The Solo Performer&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Comparison Matrix&lt;/li&gt;
&lt;li&gt;PromQL Functions by Metric Type&lt;/li&gt;
&lt;li&gt;Alerting Strategies&lt;/li&gt;
&lt;li&gt;Troubleshooting Quick Reference&lt;/li&gt;
&lt;li&gt;The Cardinality Monster&lt;/li&gt;
&lt;li&gt;Best Practices&lt;/li&gt;
&lt;li&gt;References&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The 3 AM Call
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;It's 3:17 AM. Your phone buzzes violently on the nightstand.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You grab it with one eye open. PagerDuty. Of course.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"CRITICAL: API latency exceeds threshold"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You stumble to your laptop, coffee-less and bleary-eyed. Grafana loads. The dashboard is a mess of red lines spiking upward. Your mind races: &lt;em&gt;Is this a traffic spike? A memory leak? Did someone deploy something?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You stare at the metrics. &lt;code&gt;http_requests_total&lt;/code&gt; is climbing. &lt;code&gt;process_resident_memory_bytes&lt;/code&gt; looks normal. But wait... what does that histogram actually mean? Why is the p99 showing NaN? And why on earth did someone create a metric with &lt;code&gt;user_id&lt;/code&gt; as a label?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sound familiar?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This guide exists because I've been there. We've all been there. And the truth is, most Prometheus pain comes down to one thing: &lt;strong&gt;not fully understanding the four metric types.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let me introduce you to them. Think of them as four tools in your observability toolkit. Each has a job. Each has rules. Use the wrong one, and you'll be back at 3 AM wondering why your alerts are lying to you.&lt;/p&gt;

&lt;p&gt;Let's fix that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uvp2ywfg1qzs30j5mzo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uvp2ywfg1qzs30j5mzo.png" alt="The Four Prometheus Metric Types" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference Card
&lt;/h2&gt;

&lt;p&gt;Need a quick answer? Start here.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric Type&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Key Function&lt;/th&gt;
&lt;th&gt;Suffix&lt;/th&gt;
&lt;th&gt;Can Aggregate?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Counter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Totals (requests, errors, bytes)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rate()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gauge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Current state (memory, CPU)&lt;/td&gt;
&lt;td&gt;Raw value&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Histogram&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Latency distributions&lt;/td&gt;
&lt;td&gt;&lt;code&gt;histogram_quantile()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;_seconds&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-instance percentiles&lt;/td&gt;
&lt;td&gt;Direct read&lt;/td&gt;
&lt;td&gt;&lt;code&gt;_seconds&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Only sum/count&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The Essential Queries You'll Use Every Day&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Counter: "How many requests per second are we getting?"
rate(http_requests_total[5m])

# Gauge: "How much memory are we using right now?"
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Histogram: "What's our p99 latency across all pods?"
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# Summary: "What's the average latency?" (works across instances, unlike quantiles)
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Which Metric Type Should I Use
&lt;/h2&gt;

&lt;p&gt;Before diving into the details, let me save you some time. Here's a decision flowchart that I wish someone had shown me years ago:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwgi8u9bsdz5kxthzcxms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwgi8u9bsdz5kxthzcxms.png" alt="Metric Type Decision Flowchart" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;The Quick Decision Table&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you would say...&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"How many X happened?"&lt;/td&gt;
&lt;td&gt;Counter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"What is the current X?"&lt;/td&gt;
&lt;td&gt;Gauge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"What's the p99 latency across all pods?"&lt;/td&gt;
&lt;td&gt;Histogram&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"What's the p99 on this specific pod?"&lt;/td&gt;
&lt;td&gt;Summary&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now let me tell you the stories behind each of these tools.&lt;/p&gt;
&lt;h2&gt;
  
  
  Meet the Four Metric Types
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Counter: The Tireless Bookkeeper
&lt;/h3&gt;

&lt;p&gt;Picture a diligent accountant who sits at the entrance of your application. Every time a request comes in, she makes a tally mark. Every error? Another tally. Bytes transferred? She counts them all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Counter never forgets.&lt;/strong&gt; She never erases. Her numbers only go up. The only time they reset is when she goes home for the night (your process restarts).&lt;/p&gt;
&lt;h4&gt;
  
  
  The Counter's Personality
&lt;/h4&gt;

&lt;p&gt;A &lt;strong&gt;Counter&lt;/strong&gt; is a cumulative metric that only increases. Think of it as an odometer in your car. The number only goes up. You don't care about the current number per se; you care about &lt;em&gt;how fast&lt;/em&gt; it's changing.&lt;/p&gt;

&lt;p&gt;This is the crucial insight: &lt;strong&gt;raw counter values are almost useless.&lt;/strong&gt; What you want is the &lt;em&gt;rate&lt;/em&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  When to Use a Counter
&lt;/h4&gt;

&lt;p&gt;Counters thrive when tracking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total HTTP requests received&lt;/li&gt;
&lt;li&gt;Bytes sent over the network&lt;/li&gt;
&lt;li&gt;Errors encountered&lt;/li&gt;
&lt;li&gt;Background jobs completed&lt;/li&gt;
&lt;li&gt;Messages processed from a queue&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Counter Characteristics
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Direction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only goes up (monotonically increasing)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reset Behavior&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Resets to 0 when the process restarts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Typical Suffix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;_total&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Raw Value Usefulness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (always use &lt;code&gt;rate()&lt;/code&gt; or &lt;code&gt;increase()&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h4&gt;
  
  
  Talking to the Counter: PromQL Patterns
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# The WRONG way: Raw value tells you nothing useful
http_requests_total

# The RIGHT way: Rate of requests per second over 5 minutes
rate(http_requests_total[5m])

# Filter by label (e.g., only 500 errors)
rate(http_requests_total{status="500"}[5m])

# Total increase over the last hour
increase(http_requests_total[1h])

# Sum rates across all instances
sum(rate(http_requests_total[5m]))

# Group by HTTP method
sum by (method) (rate(http_requests_total[5m]))

# The money query: Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  Counter Alerts That Actually Work
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# "Our error rate is too high"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighErrorRate&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;sum(rate(http_requests_total{status=~"5.."}[5m])) &lt;/span&gt;
    &lt;span class="s"&gt;/ &lt;/span&gt;
    &lt;span class="s"&gt;sum(rate(http_requests_total[5m])) &amp;gt; 0.05&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exceeds&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5%"&lt;/span&gt;

&lt;span class="c1"&gt;# "Traffic dropped suddenly - possible outage"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TrafficDrop&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;sum(rate(http_requests_total[5m])) &lt;/span&gt;
    &lt;span class="s"&gt;&amp;lt; &lt;/span&gt;
    &lt;span class="s"&gt;sum(rate(http_requests_total[5m] offset 1h)) * 0.5&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Traffic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dropped&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;more&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;than&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;50%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;compared&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hour&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ago"&lt;/span&gt;

&lt;span class="c1"&gt;# "We're getting zero requests - something is very wrong"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoTraffic&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(rate(http_requests_total[5m])) == &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;HTTP&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;requests&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;received&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;last&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;minutes"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Gauge: The Live Reporter
&lt;/h3&gt;

&lt;p&gt;If the Counter is an accountant tallying historical records, the &lt;strong&gt;Gauge&lt;/strong&gt; is a live news reporter telling you &lt;em&gt;what's happening right now.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;"Memory usage is at 78%!" she reports. A moment later: "It dropped to 72%!" Unlike the Counter, the Gauge's numbers go up &lt;em&gt;and&lt;/em&gt; down. She reflects the current state of the world.&lt;/p&gt;
&lt;h4&gt;
  
  
  The Gauge's Personality
&lt;/h4&gt;

&lt;p&gt;A &lt;strong&gt;Gauge&lt;/strong&gt; represents a single numerical value that can arbitrarily go up and down. It's a snapshot of reality at any moment. Think of a thermometer, a fuel gauge, or your current queue depth.&lt;/p&gt;

&lt;p&gt;The beautiful thing about gauges? &lt;strong&gt;The raw value is immediately meaningful.&lt;/strong&gt; When someone asks "How much memory are we using?", the gauge has the answer.&lt;/p&gt;
&lt;h4&gt;
  
  
  When to Use a Gauge
&lt;/h4&gt;

&lt;p&gt;Gauges excel at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current memory or CPU usage&lt;/li&gt;
&lt;li&gt;Number of active connections&lt;/li&gt;
&lt;li&gt;Queue depth&lt;/li&gt;
&lt;li&gt;Temperature readings&lt;/li&gt;
&lt;li&gt;Number of goroutines running&lt;/li&gt;
&lt;li&gt;Disk space remaining&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Gauge Characteristics
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Direction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Can increase or decrease&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reset Behavior&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not applicable (always reflects current state)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Typical Suffix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None specific&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Raw Value Usefulness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (the current value is what you want)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h4&gt;
  
  
  Talking to the Gauge: PromQL Patterns
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Direct reading - totally valid and useful
node_memory_MemAvailable_bytes

# Calculate percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Average, min, max over time
avg_over_time(node_load1[1h])
max_over_time(node_load1[1h])
min_over_time(node_load1[1h])

# Predict the future: "When will we run out of disk?"
predict_linear(node_filesystem_avail_bytes[6h], 3600 * 24)

# Rate of change (unusual for gauges, but useful for capacity planning)
deriv(node_memory_MemAvailable_bytes[5m])

# Find the top consumers
topk(5, node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h4&gt;
  
  
  Gauge Alerts That Actually Work
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# "Memory is running low"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighMemoryUsage&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 &amp;gt; 90&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;90%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.instance&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;

&lt;span class="c1"&gt;# "Disk will fill up in 24 hours" - this is the kind of proactive alert that makes SREs heroes&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DiskFillingUp&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24 * 3600) &amp;lt; 0&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Disk&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.mountpoint&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;will&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fill&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;within&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;24&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hours"&lt;/span&gt;

&lt;span class="c1"&gt;# "Connection pool is almost exhausted"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConnectionPoolNearExhaustion&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db_pool_active_connections / db_pool_max_connections &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.8&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connection&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pool&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;80%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;utilized"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Histogram: The Distribution Detective
&lt;/h3&gt;

&lt;p&gt;Now we get to the interesting ones. The &lt;strong&gt;Histogram&lt;/strong&gt; is a detective who doesn't just count crimes; she categorizes them by severity and gives you the full picture.&lt;/p&gt;

&lt;p&gt;"Out of 1000 requests," she reports, "150 completed in under 100ms, 700 completed in under 500ms, and 950 completed in under 1 second. The remaining 50 took longer."&lt;/p&gt;

&lt;p&gt;This is the power of the Histogram. It doesn't just tell you the average. It shows you the &lt;em&gt;distribution&lt;/em&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  When to Use a Histogram
&lt;/h4&gt;

&lt;p&gt;Histograms are perfect for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request latency (how long did API calls take?)&lt;/li&gt;
&lt;li&gt;Response sizes&lt;/li&gt;
&lt;li&gt;Any measurement where you need percentiles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When you need to aggregate percentiles across multiple pods&lt;/strong&gt; (this is the killer feature)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft34bwzpm68gkxppq8id2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft34bwzpm68gkxppq8id2.png" alt="How Histogram Buckets Work" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Histogram Characteristics
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Components&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Three time series: &lt;code&gt;_bucket&lt;/code&gt;, &lt;code&gt;_sum&lt;/code&gt;, &lt;code&gt;_count&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aggregation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully aggregatable across instances (this is huge!)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bucket boundaries must be defined upfront&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Typical Suffix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;_seconds&lt;/code&gt;, &lt;code&gt;_bytes&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h4&gt;
  
  
  The Histogram's Secret: Buckets
&lt;/h4&gt;

&lt;p&gt;Here's what a histogram actually creates behind the scenes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http_request_duration_seconds_bucket{le="0.1"}   --&amp;gt; 150 requests were &amp;lt;= 100ms
http_request_duration_seconds_bucket{le="0.5"}   --&amp;gt; 700 requests were &amp;lt;= 500ms
http_request_duration_seconds_bucket{le="1"}     --&amp;gt; 950 requests were &amp;lt;= 1s
http_request_duration_seconds_bucket{le="+Inf"}  --&amp;gt; 1000 requests total
http_request_duration_seconds_sum                --&amp;gt; Total time spent (e.g., 423.7 seconds)
http_request_duration_seconds_count              --&amp;gt; Total count (1000)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;le&lt;/code&gt; label means "less than or equal to." Buckets are cumulative.&lt;/p&gt;

&lt;h4&gt;
  
  
  Talking to the Histogram: PromQL Patterns
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Calculate the 50th percentile (median)
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# Calculate p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# P99 latency per endpoint (aggregated correctly!)
histogram_quantile(0.99, 
  sum by (le, endpoint) (rate(http_request_duration_seconds_bucket[5m]))
)

# Average request duration (simpler alternative)
rate(http_request_duration_seconds_sum[5m]) 
/ 
rate(http_request_duration_seconds_count[5m])

# "What percentage of requests complete in under 500ms?" (Apdex-style)
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m])) * 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Histogram Alerts That Actually Work
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# "P99 latency is too high"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighP99Latency&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;histogram_quantile(0.99, &lt;/span&gt;
      &lt;span class="s"&gt;sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))&lt;/span&gt;
    &lt;span class="s"&gt;) &amp;gt; 2&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;P99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exceeds&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;seconds&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.service&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;

&lt;span class="c1"&gt;# "Latency doubled compared to an hour ago"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LatencyDegradation&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))&lt;/span&gt;
    &lt;span class="s"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m] offset 1h))) * 2&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;P95&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2x&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;higher&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;than&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hour&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ago"&lt;/span&gt;

&lt;span class="c1"&gt;# SLO violation: "Less than 99% of requests are fast"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SLOViolation&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30m]))&lt;/span&gt;
    &lt;span class="s"&gt;/&lt;/span&gt;
    &lt;span class="s"&gt;sum(rate(http_request_duration_seconds_count[30m])) &amp;lt; 0.99&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SLO&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Violation:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Less&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;than&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;99%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;requests&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;within&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;500ms"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Summary: The Solo Performer
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Summary&lt;/strong&gt; is the Histogram's cousin. She can also give you percentiles, but with one crucial difference: she calculates them herself, on the client side.&lt;/p&gt;

&lt;p&gt;This makes her fast and precise for a single instance. But here's the catch: &lt;strong&gt;she can't collaborate.&lt;/strong&gt; If you have 10 pods running, you cannot simply combine their percentiles to get a global percentile. Averaging p99s does not give you the true p99. It's mathematically wrong.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;The Summary Trap&lt;/strong&gt;: I've seen teams spend hours debugging "wrong" percentiles, only to discover they were accidentally averaging Summary quantiles across instances. Don't be that team. If you need to aggregate, use Histograms.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  When to Use a Summary
&lt;/h4&gt;

&lt;p&gt;Summaries are appropriate when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You genuinely only care about a single instance&lt;/li&gt;
&lt;li&gt;You don't know bucket boundaries ahead of time&lt;/li&gt;
&lt;li&gt;You're maintaining legacy code (most new projects should use Histograms)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Summary Characteristics
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Components&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pre-calculated quantiles, plus &lt;code&gt;_sum&lt;/code&gt; and &lt;code&gt;_count&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aggregation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cannot aggregate quantiles (only sum/count)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Percentile Calculation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Done on the client side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Typical Suffix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;_seconds&lt;/code&gt;, &lt;code&gt;_bytes&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Talking to the Summary: PromQL Patterns
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Read quantiles directly (only meaningful per-instance)
http_request_duration_seconds{quantile="0.99"}

# Average latency - this DOES work across instances!
sum(rate(http_request_duration_seconds_sum[5m])) 
/ 
sum(rate(http_request_duration_seconds_count[5m]))

# DON'T DO THIS - averaging quantiles is mathematically wrong
# avg(http_request_duration_seconds{quantile="0.99"})

# If you must look at quantiles, do it per-instance
http_request_duration_seconds{quantile="0.99", instance="pod-1:8080"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Comparison Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Counter&lt;/th&gt;
&lt;th&gt;Gauge&lt;/th&gt;
&lt;th&gt;Histogram&lt;/th&gt;
&lt;th&gt;Summary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Direction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only up ⬆️&lt;/td&gt;
&lt;td&gt;Up and down ↕️&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Raw value useful&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Use rate()&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Required&lt;/td&gt;
&lt;td&gt;Rare&lt;/td&gt;
&lt;td&gt;On buckets&lt;/td&gt;
&lt;td&gt;On sum/count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aggregatable&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;⚠️ Only sum/count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Percentiles&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Server-side&lt;/td&gt;
&lt;td&gt;✅ Client-side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Higher&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  PromQL Functions by Metric Type
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Counter&lt;/th&gt;
&lt;th&gt;Gauge&lt;/th&gt;
&lt;th&gt;Histogram&lt;/th&gt;
&lt;th&gt;Summary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rate()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Primary&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ On buckets&lt;/td&gt;
&lt;td&gt;✅ On sum/count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;irate()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;increase()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deriv()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;delta()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;predict_linear()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;histogram_quantile()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Required&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Alerting Strategies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Golden Signals
&lt;/h3&gt;

&lt;p&gt;Google's SRE book teaches us to monitor four things. Here's how metric types map to them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1. LATENCY (Histogram) - "How long do things take?"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighLatency&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;histogram_quantile(0.99, sum by (le) (rate(http_duration_seconds_bucket[5m]))) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;

&lt;span class="c1"&gt;# 2. TRAFFIC (Counter) - "How much are we doing?"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TrafficAnomaly&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;abs(sum(rate(http_requests_total[5m])) - sum(rate(http_requests_total[5m] offset 1w)))&lt;/span&gt;
    &lt;span class="s"&gt;/ sum(rate(http_requests_total[5m] offset 1w)) &amp;gt; 0.5&lt;/span&gt;

&lt;span class="c1"&gt;# 3. ERRORS (Counter) - "How often do things fail?"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighErrorRate&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.01&lt;/span&gt;

&lt;span class="c1"&gt;# 4. SATURATION (Gauge) - "How full is our system?"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighSaturation&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;avg by (instance) (1 - rate(node_cpu_seconds_total{mode="idle"}[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.9&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  SLO-Based Multi-Burn Rate Alerts
&lt;/h3&gt;

&lt;p&gt;For the more advanced: burn rate alerts that catch both fast and slow burns of your error budget.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fast burn: 2% of monthly error budget consumed in 1 hour&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SLOFastBurn&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;(sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) &amp;gt; 14.4 * 0.001)&lt;/span&gt;
    &lt;span class="s"&gt;and&lt;/span&gt;
    &lt;span class="s"&gt;(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) &amp;gt; 14.4 * 0.001)&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;

&lt;span class="c1"&gt;# Slow burn: Steady consumption over days&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SLOSlowBurn&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;(sum(rate(http_requests_total{status=~"5.."}[6h])) / sum(rate(http_requests_total[6h])) &amp;gt; 1 * 0.001)&lt;/span&gt;
    &lt;span class="s"&gt;and&lt;/span&gt;
    &lt;span class="s"&gt;(sum(rate(http_requests_total{status=~"5.."}[3h])) / sum(rate(http_requests_total[3h])) &amp;gt; 1 * 0.001)&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Troubleshooting Quick Reference
&lt;/h2&gt;

&lt;p&gt;When things go wrong at 3 AM, use this table:&lt;/p&gt;

&lt;h3&gt;
  
  
  General Issues (All Metric Types)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;th&gt;Debug Query&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No data at all&lt;/td&gt;
&lt;td&gt;Target not scraped&lt;/td&gt;
&lt;td&gt;Check target status&lt;/td&gt;
&lt;td&gt;&lt;code&gt;up{job="my-service"}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gaps in graph&lt;/td&gt;
&lt;td&gt;Scrape failures&lt;/td&gt;
&lt;td&gt;Check scrape duration&lt;/td&gt;
&lt;td&gt;&lt;code&gt;scrape_duration_seconds{job="..."}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Too many series&lt;/td&gt;
&lt;td&gt;High cardinality&lt;/td&gt;
&lt;td&gt;Add label filters&lt;/td&gt;
&lt;td&gt;&lt;code&gt;topk(10, count by (__name__)({__name__!=""}))&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Counter Issues
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Flat line&lt;/td&gt;
&lt;td&gt;No events occurring&lt;/td&gt;
&lt;td&gt;Check application logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sudden drops&lt;/td&gt;
&lt;td&gt;Counter reset&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;rate()&lt;/code&gt; (it handles resets)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Negative rate&lt;/td&gt;
&lt;td&gt;Label churn&lt;/td&gt;
&lt;td&gt;Check for recreated series&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Gauge Issues
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Value unchanged&lt;/td&gt;
&lt;td&gt;Stale metric&lt;/td&gt;
&lt;td&gt;Check scrape status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noisy graph&lt;/td&gt;
&lt;td&gt;High variance&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;avg_over_time()&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wrong scale&lt;/td&gt;
&lt;td&gt;Unit mismatch&lt;/td&gt;
&lt;td&gt;Check metric units&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Histogram Issues
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Wrong percentile&lt;/td&gt;
&lt;td&gt;Bad bucket boundaries&lt;/td&gt;
&lt;td&gt;Add more buckets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Most values in +Inf&lt;/td&gt;
&lt;td&gt;Buckets too small&lt;/td&gt;
&lt;td&gt;Increase upper bounds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NaN result&lt;/td&gt;
&lt;td&gt;No samples&lt;/td&gt;
&lt;td&gt;Increase time window&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Summary Issues
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Wrong global p99&lt;/td&gt;
&lt;td&gt;Averaged quantiles&lt;/td&gt;
&lt;td&gt;Switch to Histogram&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Cardinality Monster
&lt;/h2&gt;

&lt;p&gt;Let me tell you about the monster that has brought down more Prometheus instances than any other: &lt;strong&gt;cardinality.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cardinality is the number of unique time series in your system. And it can explode faster than you think.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsh01xik4svrk8zjbq0el.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsh01xik4svrk8zjbq0el.png" alt="Cardinality Explosion" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Cardinality Explodes
&lt;/h3&gt;

&lt;p&gt;Every unique combination of labels creates a new time series:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1 metric × 5 methods × 10 status codes × 100 endpoints × 50 instances
= 250,000 time series from ONE metric
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Labels That Will Destroy Your Prometheus
&lt;/h3&gt;

&lt;p&gt;Never use these as labels:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Label Type&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Why It's Bad&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User IDs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;user_id="12345"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Millions of values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Request IDs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;request_id="abc-123"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;One per request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Timestamps&lt;/td&gt;
&lt;td&gt;&lt;code&gt;timestamp="2024-01-01"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Infinite growth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IP addresses&lt;/td&gt;
&lt;td&gt;&lt;code&gt;client_ip="192.168.1.1"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Thousands of values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Session tokens&lt;/td&gt;
&lt;td&gt;&lt;code&gt;session="..."&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;One per session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error messages&lt;/td&gt;
&lt;td&gt;&lt;code&gt;error="Connection refused..."&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unbounded strings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Detecting the Monster
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# How bad is it? Count all series.
count({__name__!=""})

# Find the offenders
topk(10, count by (__name__) ({__name__!=""}))

# Check per-label cardinality
count by (endpoint) (http_requests_total)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cardinality Guidelines
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Series Count&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🟢 Low&lt;/td&gt;
&lt;td&gt;Under 1,000&lt;/td&gt;
&lt;td&gt;You're fine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟡 Moderate&lt;/td&gt;
&lt;td&gt;1K - 10K&lt;/td&gt;
&lt;td&gt;Monitor it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟠 High&lt;/td&gt;
&lt;td&gt;10K - 100K&lt;/td&gt;
&lt;td&gt;Investigate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔴 Critical&lt;/td&gt;
&lt;td&gt;Over 100K&lt;/td&gt;
&lt;td&gt;Fix immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Do These Things
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Always use &lt;code&gt;rate()&lt;/code&gt; with counters&lt;/strong&gt; - Raw values are useless&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set rate window to 2-4x scrape interval&lt;/strong&gt; - Ensures enough data points&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Include &lt;code&gt;le&lt;/code&gt; in your &lt;code&gt;by&lt;/code&gt; clause&lt;/strong&gt; before &lt;code&gt;histogram_quantile()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use histograms for percentiles&lt;/strong&gt; - They aggregate correctly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add &lt;code&gt;for&lt;/code&gt; duration to alerts&lt;/strong&gt; - Prevents flapping&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define bucket boundaries based on SLOs&lt;/strong&gt; - Know what matters&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Avoid These Mistakes
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Averaging summary quantiles&lt;/strong&gt; - Mathematically wrong&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using &lt;code&gt;irate()&lt;/code&gt; for alerting&lt;/strong&gt; - Too volatile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting on raw gauge spikes&lt;/strong&gt; - Use &lt;code&gt;for&lt;/code&gt; duration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High cardinality labels&lt;/strong&gt; - They'll kill your Prometheus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;avg_over_time(rate(...))&lt;/code&gt;&lt;/strong&gt; - Just use a larger rate window&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/concepts/metric_types/" rel="noopener noreferrer"&gt;Prometheus Official Documentation: Metric Types&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/prometheus/latest/querying/basics/" rel="noopener noreferrer"&gt;Prometheus Official Documentation: Querying Basics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/prometheus/latest/querying/functions/" rel="noopener noreferrer"&gt;Prometheus Official Documentation: Querying Functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/" rel="noopener noreferrer"&gt;Prometheus Official Documentation: Alerting Rules&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sre.google/sre-book/monitoring-distributed-systems/" rel="noopener noreferrer"&gt;Google SRE Book: Monitoring Distributed Systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sre.google/workbook/alerting-on-slos/" rel="noopener noreferrer"&gt;Google SRE Workbook: Alerting on SLOs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.robustperception.io/how-does-a-prometheus-histogram-work" rel="noopener noreferrer"&gt;Robust Perception: How does a Prometheus Histogram work?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.robustperception.io/how-does-a-prometheus-summary-work" rel="noopener noreferrer"&gt;Robust Perception: How does a Prometheus Summary work?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dash0.com/knowledge/prometheus-metrics" rel="noopener noreferrer"&gt;Dash0: Understanding the Prometheus Metric Types&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://betterstack.com/community/guides/monitoring/prometheus-metrics-explained/" rel="noopener noreferrer"&gt;Better Stack: Prometheus Metrics Explained&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;So here we are. It's 4:15 AM, but you're no longer panicking.&lt;/p&gt;

&lt;p&gt;You know that the &lt;strong&gt;Counter&lt;/strong&gt; is your reliable bookkeeper, always tallying but never forgetting. You query her with &lt;code&gt;rate()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You know that the &lt;strong&gt;Gauge&lt;/strong&gt; is your live reporter, giving you the current state. Her raw values make sense.&lt;/p&gt;

&lt;p&gt;You know that the &lt;strong&gt;Histogram&lt;/strong&gt; is your distribution detective, revealing the patterns in your latency. She aggregates correctly across all your pods.&lt;/p&gt;

&lt;p&gt;And you know to be careful with the &lt;strong&gt;Summary&lt;/strong&gt;, the solo performer who can't collaborate across instances.&lt;/p&gt;

&lt;p&gt;Most importantly, you've learned to respect the &lt;strong&gt;Cardinality Monster&lt;/strong&gt; and keep him caged.&lt;/p&gt;

&lt;p&gt;The pager may buzz again. But next time, you'll know exactly what you're looking at.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Now go get some sleep. You've earned it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>prometheus</category>
      <category>monitoring</category>
      <category>devops</category>
      <category>observability</category>
    </item>
    <item>
      <title>Platform Engineering Principles</title>
      <dc:creator>Sunny Nazar</dc:creator>
      <pubDate>Thu, 29 May 2025 14:41:06 +0000</pubDate>
      <link>https://dev.to/sunnynazar/platform-engineering-principles-2kfe</link>
      <guid>https://dev.to/sunnynazar/platform-engineering-principles-2kfe</guid>
      <description>&lt;p&gt;In today’s fast-paced, cloud-native world, Platform Engineering has emerged as a critical discipline for delivering &lt;strong&gt;self-service, scalable, reliable, secure, cost-optimized, and efficient software delivery platforms&lt;/strong&gt;. Whether you’re building internal developer platforms, shared infrastructure, or enabling DevOps practices, a well-designed platform has become the backbone of modern engineering organizations.&lt;/p&gt;

&lt;p&gt;But what exactly makes a platform successful? What are the core principles to keep in mind when building such platforms? And what are the common pitfalls to avoid?&lt;/p&gt;

&lt;p&gt;Let’s dive deep into the fundamental principles that underpin effective and sustainable platform engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Principles of Platform Engineering
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
Developer Experience (DX)
&lt;/li&gt;
&lt;li&gt;
Security and Compliance
&lt;/li&gt;
&lt;li&gt;
Multi-Tenancy and Isolation
&lt;/li&gt;
&lt;li&gt;
Observability and Transparency
&lt;/li&gt;
&lt;li&gt;
Automation and Self-Healing
&lt;/li&gt;
&lt;li&gt;
Scalability and Reliability
&lt;/li&gt;
&lt;li&gt;
Standards and Governance
&lt;/li&gt;
&lt;li&gt;
Cost Awareness
&lt;/li&gt;
&lt;li&gt;
Feedback Loops and Continuous Improvement
&lt;/li&gt;
&lt;li&gt;
Modularity and Extensibility
&lt;/li&gt;
&lt;li&gt;Documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Common Challenges and Bonus Points
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
Common Challenges in Platform Engineering
&lt;/li&gt;
&lt;li&gt;
Bonus Points &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Core Principles of Platform Engineering
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Developer Experience (DX)
&lt;/h2&gt;

&lt;p&gt;A platform exists to empower developers. The best platforms reduce friction, simplify workflows, and enable teams to deliver faster and more reliably. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Self-service capabilities (e.g., provisioning, deployments).&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Golden paths that provide standardized, pre-approved templates.&lt;/li&gt;
&lt;li&gt;Clear and Accessible documentation.&lt;/li&gt;
&lt;li&gt;Intuitive UIs and APIs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy developers are productive developers. Prioritize their experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security and Compliance
&lt;/h2&gt;

&lt;p&gt;Security should not be an afterthought. Platforms must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enforce least privilege access and zero trust principles&lt;/strong&gt; (e.g., SSO, SCP)&lt;/li&gt;
&lt;li&gt;Automate policy checks (e.g., OPA/Gatekeeper, Kyverno).&lt;/li&gt;
&lt;li&gt;Secure secrets management (e.g., HashiCorp Vault, AWS Secrets Manager).&lt;/li&gt;
&lt;li&gt;Maintain audit trails and compliance logs. (e.g., Cloudtrail)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By baking security into the platform itself, you minimize risks and simplify compliance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Tenancy and Isolation
&lt;/h2&gt;

&lt;p&gt;When multiple teams or products share a platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Isolate workloads with namespaces, network policies, and resource quotas and network segmentation (VPC/VNET)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Implement tenant-specific RBAC.&lt;/li&gt;
&lt;li&gt;Ensure fair usage to prevent noisy neighbor issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tenant isolation is critical for security and performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability and Transparency
&lt;/h2&gt;

&lt;p&gt;A platform must be transparent in its operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Centralized logging, metrics, and tracing&lt;/strong&gt; (e.g., Prometheus, Grafana, Loki, OpenTelemetry).&lt;/li&gt;
&lt;li&gt;Dashboards for both platform engineers and end users.&lt;/li&gt;
&lt;li&gt;Real-time alerts and root cause analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observability helps diagnose issues quickly and keeps everyone informed and brings in the transparency in the platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automation and Self-Healing
&lt;/h2&gt;

&lt;p&gt;Reduce manual toil by automating:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure provisioning (e.g., Terraform)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;CI/CD pipelines (e.g., GitHub Actions, ArgoCD).&lt;/li&gt;
&lt;li&gt;Remediation of failures, scaling, and resource management.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Platforms should be self-healing and resilient by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scalability and Reliability
&lt;/h2&gt;

&lt;p&gt;A platform must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scale horizontally as demand grows&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Handle failures gracefully with retries, circuit breakers, and failovers.&lt;/li&gt;
&lt;li&gt;Provide service level objectives (SLOs) and error budgets to manage reliability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reliable platforms build trust and Scalability helps in the expansion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standards and Governance
&lt;/h2&gt;

&lt;p&gt;Consistency accelerates delivery by defining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Golden paths with approved tech stacks and best practices.&lt;/li&gt;
&lt;li&gt;Re-use existing solutions and industry-standard tools wherever possible. &lt;strong&gt;Avoid reinventing the wheel&lt;/strong&gt;, it’s often more efficient to adopt battle-tested patterns and tools than to build everything from scratch.&lt;/li&gt;
&lt;li&gt;Code and configuration linting and validation.&lt;/li&gt;
&lt;li&gt;Governance policies and automated enforcement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By providing paved roads, teams can focus on innovation, not reinvention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Awareness
&lt;/h2&gt;

&lt;p&gt;Platform costs can spiral out of control. Adopt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resource optimization and rightsizing&lt;/strong&gt;. (e.g., Karpenter)&lt;/li&gt;
&lt;li&gt;Cost visibility dashboards.&lt;/li&gt;
&lt;li&gt;Fair usage policies and showback/chargeback models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cost-efficient platforms are sustainable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feedback Loops and Continuous Improvement
&lt;/h2&gt;

&lt;p&gt;Listen to your users and developers, and iterate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Gather feedback via surveys, interviews, or support tickets.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Measure adoption, usage, and friction points.&lt;/li&gt;
&lt;li&gt;Prioritize enhancements and bug fixes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A great platform evolves with its users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Modularity and Extensibility
&lt;/h2&gt;

&lt;p&gt;Avoid monoliths. Instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build modular, loosely coupled components.&lt;/li&gt;
&lt;li&gt;Adopt an &lt;strong&gt;API-first approach&lt;/strong&gt;: Design your platform’s capabilities as well-defined APIs that can be consumed by both internal and external systems. This promotes reusability, integration, and flexibility. APIs should be well-documented, versioned, and governed.&lt;/li&gt;
&lt;li&gt;Allow for extensibility and plugin models.&lt;/li&gt;
&lt;li&gt;Support gradual adoption and migration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modular platforms adapt better to changing needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Documentation
&lt;/h2&gt;

&lt;p&gt;Documentation is as important as code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provide step-by-step guides, tutorials, and FAQs&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Keep documentation up to date and discoverable.&lt;/li&gt;
&lt;li&gt;Offer workshops and internal community support.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Knowledge-sharing accelerates adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Challenges in Platform Engineering
&lt;/h2&gt;

&lt;p&gt;While the principles provide a solid foundation, platform engineering comes with its own set of challenges. Here are some common pitfalls and how to mitigate them:&lt;/p&gt;

&lt;h3&gt;
  
  
  Over-Engineering
&lt;/h3&gt;

&lt;p&gt;It’s tempting to design a "perfect" platform with every feature imaginable. But this often leads to complexity and low adoption. Start small, deliver value early, and iterate based on the users feedback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Neglecting Developer Experience
&lt;/h3&gt;

&lt;p&gt;A platform that’s hard to use will be ignored. Ensure a clear focus on usability, documentation, and support, and involve developers early in the design process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security as an Afterthought
&lt;/h3&gt;

&lt;p&gt;Retrofitting security is expensive and risky. Integrate it from the start, automate checks, enforce the least privilege, and audit all access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lack of Observability
&lt;/h3&gt;

&lt;p&gt;Without good logging, monitoring, and tracing, troubleshooting becomes a nightmare. Prioritize observability as a first-class citizen of the platform.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rigid Governance
&lt;/h3&gt;

&lt;p&gt;While standards are essential, being too rigid stifles innovation. Provide "golden paths" but allow for "escape hatches" when teams need flexibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ignoring Costs
&lt;/h3&gt;

&lt;p&gt;Platforms can become expensive, especially at scale. Regularly review usage, optimize resource allocation, and implement cost transparency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Underestimating Change Management
&lt;/h3&gt;

&lt;p&gt;Introducing a platform often means changing how teams work. Invest in onboarding, training, and support to drive adoption and reduce resistance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bonus Points
&lt;/h2&gt;

&lt;p&gt;A successful platform isn’t just about technology, it’s about empowering teams, reducing friction, ensuring security, and enabling innovation.&lt;/p&gt;

&lt;p&gt;Also, just don't follow the trend's blindly, see what fits best as per the developer needs and aids in an effective software delivery. (e.g., if you are a small platform team (2-3 members) and serve a handful of development teams, focus on solving the real problems by understanding their needs and not just blindly adopting a trending technology like Kubernetes (keep operational overhead, complexity and steep learning curve in mind).&lt;/p&gt;

&lt;p&gt;By embracing these principles and being mindful of common challenges, platform engineers can build systems that not only scale technically but also foster a culture of collaboration, ownership, and excellence.&lt;/p&gt;

&lt;p&gt;Whether you’re just starting your platform journey or scaling an existing one, these principles provide a solid foundation for sustainable success.&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>devex</category>
      <category>cloudnative</category>
      <category>platform</category>
    </item>
    <item>
      <title>AWS LAMBDA BEST PRACTICES</title>
      <dc:creator>Sunny Nazar</dc:creator>
      <pubDate>Fri, 31 Mar 2023 15:52:48 +0000</pubDate>
      <link>https://dev.to/aws-builders/aws-lambda-best-practices-4chn</link>
      <guid>https://dev.to/aws-builders/aws-lambda-best-practices-4chn</guid>
      <description>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Overview&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best Practices&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Right language, Small functions and Trigger type&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Lambda Layers&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Optimize cold start times&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Environment Variables for Configuration&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Concurrency setting&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Use the right memory and CPU settings&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Secure your Lambda functions&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Dead Letter Queue (DLQ) and Retries for error handling&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Testing, Versioning and Aliases for Deployment&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Error Handling, Logging, Monitoring, Tracing&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;strong&gt;Documentation Links&lt;/strong&gt;&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Overview&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;AWS Lambda is a serverless computing platform that allows you to run your code in response to events and only pay for the compute time consumed. With Lambda, you can build and deploy applications without worrying about the underlying infrastructure. However, like any other technology, there are best practices that you can follow to ensure that you get the most out of it. In this blog, we'll look at some AWS Lambda best practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Best Practices&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Right language, Small functions and Trigger type&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;AWS Lambda natively supports various programming languages like Java, Go, PowerShell, Node.js, C#, Python, and Ruby code. Lambda also provides &lt;strong&gt;Runtime API&lt;/strong&gt; which allows you to use any additional programming languages to create your functions.When choosing a language for your function, please consider your use case and the language's strengths.&lt;/p&gt;

&lt;p&gt;AWS Lambda is designed to run &lt;em&gt;small, focused functions&lt;/em&gt;. When building your functions, try to keep them as small as possible and focused on a single task. This makes it easier to test, deploy, and maintain your code. Please note that you can run Lambda functions for &lt;em&gt;only 15 minutes&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;AWS Lambda supports different trigger types, such as API Gateway, S3, and CloudWatch Events. Choose the right trigger type for your function based on your use case and expected workload.If you follow event-driven architecture that should already help you in choosing right trigger type.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Lambda Layers&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;If you have code that is shared across multiple functions, please consider using AWS Lambda Layers to manage it. A layer is a ZIP archive that contains libraries, custom runtimes, or other function code. You can use layers to manage dependencies, reduce the size of your function deployment packages, and simplify your code maintenance.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Optimize cold start times&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Cold start times can impact the performance of your Lambda functions, especially for infrequently used functions. Optimize your code and use the right runtime to reduce cold start times. Some tips could be :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce the size of your deployment package.&lt;/li&gt;
&lt;li&gt;Use a language that has faster startup time.&lt;/li&gt;
&lt;li&gt;Use provisioned concurrency.&lt;/li&gt;
&lt;li&gt;Optimize resource allocation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Environment Variables for Configuration&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;When building your functions, you may need to configure them with environment variables, such as API keys or database connection credentials.Use environment variables to store configuration settings instead of hard-coding them in your function's code. This makes it easier to manage your configuration and update it as needed. Best practice is to make use of SSM Parameter Store and Secrets Managers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Concurrency setting&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Configure your Lambda function with the right concurrency settings to handle incoming requests. Below tips will help you to have right settings.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Understand your application's requirements&lt;/em&gt;: The first step in setting concurrency is to understand your application's requirements. Determine how many requests per second your application needs to handle, and set the concurrency limit accordingly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Use auto-scaling&lt;/em&gt;: AWS Lambda can automatically scale the number of concurrent executions based on the number of requests coming in. By enabling auto-scaling, you can ensure that your functions are able to handle bursts of traffic without being overwhelmed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Reserve concurrency&lt;/em&gt;: The default value of concurrent Lambda functions in an AWS account in a region is 1000. This means that by default, up to 1000 requests can be processed simultaneously across all Lambda functions in that region. Reserving concurrency allows you to ensure that a certain number of executions are always available, even when other functions are using up the concurrency pool. This can be useful for functions that need to respond quickly to requests, such as real-time applications. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Monitor and adjust&lt;/em&gt;: It's important to monitor the concurrency usage of your functions and adjust the concurrency limit accordingly. If you're consistently hitting the concurrency limit, consider increasing it. Conversely, if you're consistently underutilizing your concurrency, consider reducing the limit to save costs.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Use the right memory and CPU settings&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Configure your Lambda function with the right amount of memory and CPU to ensure optimal performance. This will depend on your workload, so be sure to test your functions under different load conditions and scenarios. Best practice is start with minimum required cpu and memory settings.And as you test your function, adjust these settings accordingly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Secure your Lambda functions&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Use AWS Identity and Access Management (IAM) to restrict access to your Lambda functions (using least privilege access) and use encryption to protect your data at rest and in transit. Also make sure IAM role needed for lambda function follows the least privilage access principal.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Dead Letter Queue (DLQ) and Retries for error handling&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Retries in Lambda functions refer to the number of times AWS Lambda will automatically retry a function invocation in case of a function error. By default, AWS Lambda retries function invocations twice, with an exponential backoff in between retries.&lt;br&gt;
A DLQ is a queue where AWS Lambda can send failed or discarded messages, which can be used for further analysis or processing. Configure a DLQ for your Lambda function to handle errors more effectively and prevent data loss. This is particularly useful when using asynchronous event sources like SNS or Kinesis.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Testing, Versioning and Aliases for Deployment&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Test your Lambda functions thoroughly before deploying them to production. Use a combination of unit tests, integration tests, and end-to-end tests to ensure that your functions are working as expected.&lt;/p&gt;

&lt;p&gt;When deploying your functions, use versioning and aliases to manage your code. Versioning allows you to create and manage multiple versions of your function code, while aliases provide a consistent name for your function's entry point. This makes it easier to manage deployments and rollbacks.&lt;/p&gt;

&lt;p&gt;Use a deployment pipeline to automate the process of building, testing, and deploying your Lambda functions. This can help you release new features and updates more frequently and with less risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;Error Handling, Logging, Monitoring, Tracing&lt;/em&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;AWS Lambda Power tools to simplify your code:&lt;br&gt;
&lt;a href="https://github.com/awslabs/aws-lambda-powertools-python" rel="noopener noreferrer"&gt;AWS Lambda Power tools&lt;/a&gt; is a set of open-source utilities and libraries that help simplify your code and improve observability. It includes modules for logging, error handling, metrics, and tracing, and can help reduce the amount of boilerplate code you need to write.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Monitor Your Functions for Performance and Errors:&lt;br&gt;
AWS Lambda integrates with CloudWatch Metrics, which allows you to monitor your functions for performance and errors. Make sure that you configure your metrics to track the right metrics and set up alarms to notify you of any issues.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use AWS X-Ray for tracing: Use AWS X-Ray to trace requests through your Lambda function and other AWS services. This can help you identify performance bottlenecks and troubleshoot issues more easily.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use Logging for Debugging:&lt;br&gt;
When developing your functions, use logging to help you debug issues. AWS Lambda integrates with CloudWatch Logs, which allows you to view and analyze your logs in real-time. Make sure that your logging is comprehensive and includes useful information, such as error messages and input parameters.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Documentation Links&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html" rel="noopener noreferrer"&gt;AWS LAMBDA BEST PRACTICES&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/architecture/best-practices-for-developing-on-aws-lambda/" rel="noopener noreferrer"&gt;AWS LAMBDA ARCHITECTURE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/lambda-monitoring.html" rel="noopener noreferrer"&gt;AWS LAMBDA MONITORING&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/lambda/faqs" rel="noopener noreferrer"&gt;AWS LAMBDA FAQS&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AWS Lambda is a powerful tool for building and deploying serverless applications. By following these best practices, you can ensure that your functions are scalable, secure, and easy to manage. With these best practices, you can build robust and reliable serverless applications on AWS Lambda.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>lambda</category>
      <category>cloud</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Securely Access Your EC2 Instances with AWS Systems Manager SSM and VPC Endpoints</title>
      <dc:creator>Sunny Nazar</dc:creator>
      <pubDate>Wed, 29 Mar 2023 15:00:19 +0000</pubDate>
      <link>https://dev.to/aws-builders/securely-access-your-ec2-instances-with-aws-systems-manager-ssm-and-vpc-endpoints-1bli</link>
      <guid>https://dev.to/aws-builders/securely-access-your-ec2-instances-with-aws-systems-manager-ssm-and-vpc-endpoints-1bli</guid>
      <description>&lt;ul&gt;
&lt;li&gt;Overview&lt;/li&gt;
&lt;li&gt;
Background Knowledge

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;What is SSH-Less Login?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;What is AWS Systems Manager (SSM)?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;How to Use SSM for SSH-Less Login?&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Terraform code&lt;/li&gt;

&lt;li&gt;Documentation Links&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;As more and more organizations adopt cloud computing, managing resources on cloud platforms like Amazon Web Services (AWS) becomes increasingly important. The need to manage multiple instances of Amazon Elastic Compute Cloud (EC2) instances effectively has led to the development of various tools to simplify the process. One such tool is the &lt;strong&gt;AWS Systems Manager (SSM)&lt;/strong&gt;, which enables users to manage EC2 instances, as well as other AWS resources, using a single interface. One of the most powerful features of SSM is the ability to perform &lt;strong&gt;SSH-less login&lt;/strong&gt; to EC2 machines, which we will explore in this blog.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background Knowledge
&lt;/h2&gt;



&lt;h3&gt;
  
  
  &lt;em&gt;What is SSH-Less Login?&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Traditionally, logging into an EC2 instance involves connecting via SSH with a username and password or a key pair. However, managing SSH keys can be challenging, particularly when dealing with multiple EC2 instances. SSH-Less login, on the other hand, is a secure and more efficient method of accessing EC2 instances without requiring SSH keys.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;What is AWS Systems Manager (SSM)?&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;AWS Systems Manager (SSM) is a management service that enables users to automate the management of their EC2 instances and other AWS resources. SSM enables users to perform various tasks, including software installation, patching, and maintenance across a fleet of EC2 instances. It also provides a single interface to manage EC2 instances running in different regions and accounts.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;How to Use SSM for SSH-Less Login?&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;To use SSM for SSH-less login, follow the steps below:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security Group for EC2 Instance&lt;/strong&gt;: The minimum traffic you need to allow for SSM access to work is to add an Outbound HTTPS (port 443) in the security group for EC2 instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create an IAM Role&lt;/strong&gt;: To use SSM to log in to EC2 instances, you must first create an IAM role with the required permissions. The role must have the AmazonEC2RoleforSSM policy attached to it, which allows SSM to access the EC2 instances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install SSM Agent&lt;/strong&gt;: After creating the IAM role, you need to install the SSM agent on each EC2 instance you want to access using SSM. The SSM agent is pre-installed on Amazon Linux 2 and Amazon Linux AMIs, but you must install it manually on other instances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configure EC2 Instances&lt;/strong&gt;: Once the SSM agent is installed, you need to configure your EC2 instances to allow SSM access. You can do this by creating a VPC endpoint for SSM. VPC endpoints which are required when using Private Subnets are below:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;com.amazonaws.region.ec2messages&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;com.amazonaws.region.ssmmessages&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;com.amazonaws.region.ssm&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;com.amazonaws.region.kms&lt;/em&gt; (This is needed if you want to use AWS KMS encryption for Session Manager.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;The security group for VPC Endpoints must allow inbound HTTPS (port 443) traffic from the resources in your VPC that communicate with the service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraform code
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Let's first start with creating VPC, Public Subnet, Private Subnet, Internet Gateway, Nat Gateway and Route tables&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prerequisite - Create provider configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;required_providers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;aws&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hashicorp/aws"&lt;/span&gt;
      &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 4.60.0"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="s2"&gt;"aws"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;region&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Variable definition can be done like this:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Please set variable region as per your needs.&lt;/span&gt;
&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"region"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Region for the resource deployment"&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eu-central-1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a VPC&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc"&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_block&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/16"&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-${var.region}"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create an internet gateway&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_internet_gateway"&lt;/span&gt; &lt;span class="s2"&gt;"gw"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"igw-${var.region}"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create a public subnet&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_subnet"&lt;/span&gt; &lt;span class="s2"&gt;"public_subnet"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_block&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.1.0/24"&lt;/span&gt;
  &lt;span class="nx"&gt;availability_zone&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.region}a"&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Public Subnet"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create a private subnet&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_subnet"&lt;/span&gt; &lt;span class="s2"&gt;"private_subnet"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_block&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.2.0/24"&lt;/span&gt;
  &lt;span class="nx"&gt;availability_zone&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${var.region}a"&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Private Subnet"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create a NAT gateway&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_nat_gateway"&lt;/span&gt; &lt;span class="s2"&gt;"nat_gateway"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;allocation_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_eip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nat_eip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ngw-${var.region}"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create an EIP for the NAT gateway&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_eip"&lt;/span&gt; &lt;span class="s2"&gt;"nat_eip"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create a public route table and associate it with the public subnet&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route_table"&lt;/span&gt; &lt;span class="s2"&gt;"public_route_table"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;route&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_block&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
    &lt;span class="nx"&gt;gateway_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_internet_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;gw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Public route table"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route_table_association"&lt;/span&gt; &lt;span class="s2"&gt;"public_route_table_association"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;route_table_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_route_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;public_route_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create a private route table and associate it with the private subnet&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route_table"&lt;/span&gt; &lt;span class="s2"&gt;"private_route_table"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;route&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_block&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;
    &lt;span class="nx"&gt;nat_gateway_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_nat_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nat_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Private route table"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_route_table_association"&lt;/span&gt; &lt;span class="s2"&gt;"private_route_table_association"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;route_table_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_route_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_route_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Let's now create EC2 and Endpoint Security Group&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a security group for the EC2 instance&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_security_group"&lt;/span&gt; &lt;span class="s2"&gt;"instance_security_group"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name_prefix&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"instance-sg"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"security group for the EC2 instance"&lt;/span&gt;

  &lt;span class="c1"&gt;# Allow outbound HTTPS traffic&lt;/span&gt;
  &lt;span class="nx"&gt;egress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;from_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
    &lt;span class="nx"&gt;to_port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
    &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0.0.0.0/0"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow HTTPS outbound traffic"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"EC2 Instance security group"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Security group for VPC Endpoints&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_security_group"&lt;/span&gt; &lt;span class="s2"&gt;"vpc_endpoint_security_group"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name_prefix&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vpc-endpoint-sg"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"security group for VPC Endpoints"&lt;/span&gt;

  &lt;span class="c1"&gt;# Allow inbound HTTPS traffic&lt;/span&gt;
  &lt;span class="nx"&gt;ingress&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;from_port&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
    &lt;span class="nx"&gt;to_port&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
    &lt;span class="nx"&gt;protocol&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt;
    &lt;span class="nx"&gt;cidr_blocks&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cidr_block&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow HTTPS traffic from VPC"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"VPC Endpoint security group"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Now we can create VPC Endpoints&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;endpoints&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"endpoint-ssm"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ssm"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="s2"&gt;"endpoint-ssmm-essages"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ssmmessages"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="s2"&gt;"endpoint-ec2-messages"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ec2messages"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"endpoints"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;for_each&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;endpoints&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_endpoint_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Interface"&lt;/span&gt;
  &lt;span class="nx"&gt;service_name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"com.amazonaws.${var.region}.${each.value.name}"&lt;/span&gt;
  &lt;span class="c1"&gt;# Add a security group to the VPC endpoint&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_endpoint_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;After creating endpoints, the final components are Instance profile and EC2 instance.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create IAM role for EC2 instance&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"ec2_role"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"EC2_SSM_Role"&lt;/span&gt;

  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Effect&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
        &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;Service&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ec2.amazonaws.com"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts:AssumeRole"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Attach AmazonSSMManagedInstanceCore policy to the IAM role&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy_attachment"&lt;/span&gt; &lt;span class="s2"&gt;"ec2_role_policy"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;policy_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ec2_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create an instance profile for the EC2 instance and associate the IAM role&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_instance_profile"&lt;/span&gt; &lt;span class="s2"&gt;"ec2_instance_profile"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"EC2_SSM_Instance_Profile"&lt;/span&gt;

  &lt;span class="nx"&gt;roles&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ec2_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_ami"&lt;/span&gt; &lt;span class="s2"&gt;"amazon_linux_2_ssm"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;most_recent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"owner-alias"&lt;/span&gt;
    &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"amazon"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"name"&lt;/span&gt;
    &lt;span class="nx"&gt;values&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"amzn2-ami-hvm-*-x86_64-ebs"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create EC2 instance&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_instance"&lt;/span&gt; &lt;span class="s2"&gt;"ec2_instance"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;ami&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_ami&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;amazon_linux_2_ssm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;instance_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t2.micro"&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_security_group_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;instance_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;iam_instance_profile&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_instance_profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ec2_instance_profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Access EC2 Instance using SSM&lt;/strong&gt;: After completing the above steps, you can access your EC2 instances using SSM without requiring an SSH key. To do this, navigate to the EC2 console and select the instance you want to access. Then, click on the "Connect" button and select "Session Manager" from the dropdown menu. This will open a web-based shell that allows you to interact with the instances.&lt;/p&gt;

&lt;h2&gt;
  
  
  Documentation Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://repost.aws/knowledge-center/ec2-systems-manager-vpc-endpoints" rel="noopener noreferrer"&gt;Systems Manager to manage private EC2 instances&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/systems-manager/" rel="noopener noreferrer"&gt;AWS System Managers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Using SSM for SSH-less login provides a secure and efficient way to manage multiple EC2 instances without the need for managing SSH keys. SSM makes it easy to perform tasks like software installation, patching, and maintenance across a fleet of EC2 instances using a single interface. With the steps outlined above, you can easily set up SSH-less login for your EC2 instances and enjoy the benefits of streamlined instance management.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ec2</category>
      <category>terraform</category>
      <category>awscommunitybuilders</category>
    </item>
  </channel>
</rss>
