DEV Community: Sunny Nazar

The Complete Guide to Prometheus Metric Types

Sunny Nazar — Sun, 11 Jan 2026 20:42:05 +0000

The Complete Guide to Prometheus Metric Types: PromQL, Alerting and Troubleshooting

Reading Time: 15 minutes

The 3 AM Call
Quick Reference Card
Which Metric Type Should I Use
Meet the Four Metric Types
- Counter: The Tireless Bookkeeper
- Gauge: The Live Reporter
- Histogram: The Distribution Detective
- Summary: The Solo Performer
Comparison Matrix
PromQL Functions by Metric Type
Alerting Strategies
Troubleshooting Quick Reference
The Cardinality Monster
Best Practices
References
Conclusion

The 3 AM Call

It's 3:17 AM. Your phone buzzes violently on the nightstand.

You grab it with one eye open. PagerDuty. Of course.

"CRITICAL: API latency exceeds threshold"

You stumble to your laptop, coffee-less and bleary-eyed. Grafana loads. The dashboard is a mess of red lines spiking upward. Your mind races: Is this a traffic spike? A memory leak? Did someone deploy something?

You stare at the metrics. http_requests_total is climbing. process_resident_memory_bytes looks normal. But wait... what does that histogram actually mean? Why is the p99 showing NaN? And why on earth did someone create a metric with user_id as a label?

Sound familiar?

This guide exists because I've been there. We've all been there. And the truth is, most Prometheus pain comes down to one thing: not fully understanding the four metric types.

Let me introduce you to them. Think of them as four tools in your observability toolkit. Each has a job. Each has rules. Use the wrong one, and you'll be back at 3 AM wondering why your alerts are lying to you.

Let's fix that.

Quick Reference Card

Need a quick answer? Start here.

Metric Type	Best For	Key Function	Suffix	Can Aggregate?
Counter	Totals (requests, errors, bytes)	`rate()`	`_total`	✅ Yes
Gauge	Current state (memory, CPU)	Raw value	None	✅ Yes
Histogram	Latency distributions	`histogram_quantile()`	`_seconds`	✅ Yes
Summary	Per-instance percentiles	Direct read	`_seconds`	⚠️ Only sum/count

The Essential Queries You'll Use Every Day

# Counter: "How many requests per second are we getting?"
rate(http_requests_total[5m])

# Gauge: "How much memory are we using right now?"
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Histogram: "What's our p99 latency across all pods?"
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# Summary: "What's the average latency?" (works across instances, unlike quantiles)
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

Which Metric Type Should I Use

Before diving into the details, let me save you some time. Here's a decision flowchart that I wish someone had shown me years ago:

The Quick Decision Table

If you would say...	Use
"How many X happened?"	Counter
"What is the current X?"	Gauge
"What's the p99 latency across all pods?"	Histogram
"What's the p99 on this specific pod?"	Summary

Now let me tell you the stories behind each of these tools.

Meet the Four Metric Types

Counter: The Tireless Bookkeeper

Picture a diligent accountant who sits at the entrance of your application. Every time a request comes in, she makes a tally mark. Every error? Another tally. Bytes transferred? She counts them all.

The Counter never forgets. She never erases. Her numbers only go up. The only time they reset is when she goes home for the night (your process restarts).

The Counter's Personality

A Counter is a cumulative metric that only increases. Think of it as an odometer in your car. The number only goes up. You don't care about the current number per se; you care about how fast it's changing.

This is the crucial insight: raw counter values are almost useless. What you want is the rate.

When to Use a Counter

Counters thrive when tracking:

Total HTTP requests received
Bytes sent over the network
Errors encountered
Background jobs completed
Messages processed from a queue

Counter Characteristics

Property	Value
Direction	Only goes up (monotonically increasing)
Reset Behavior	Resets to 0 when the process restarts
Typical Suffix	`_total`
Raw Value Usefulness	Low (always use `rate()` or `increase()`)

Talking to the Counter: PromQL Patterns

# The WRONG way: Raw value tells you nothing useful
http_requests_total

# The RIGHT way: Rate of requests per second over 5 minutes
rate(http_requests_total[5m])

# Filter by label (e.g., only 500 errors)
rate(http_requests_total{status="500"}[5m])

# Total increase over the last hour
increase(http_requests_total[1h])

# Sum rates across all instances
sum(rate(http_requests_total[5m]))

# Group by HTTP method
sum by (method) (rate(http_requests_total[5m]))

# The money query: Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100

Counter Alerts That Actually Work

# "Our error rate is too high"
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / 
    sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate exceeds 5%"

# "Traffic dropped suddenly - possible outage"
- alert: TrafficDrop
  expr: |
    sum(rate(http_requests_total[5m])) 
    < 
    sum(rate(http_requests_total[5m] offset 1h)) * 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Traffic dropped by more than 50% compared to 1 hour ago"

# "We're getting zero requests - something is very wrong"
- alert: NoTraffic
  expr: sum(rate(http_requests_total[5m])) == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "No HTTP requests received in the last 5 minutes"

Gauge: The Live Reporter

If the Counter is an accountant tallying historical records, the Gauge is a live news reporter telling you what's happening right now.

"Memory usage is at 78%!" she reports. A moment later: "It dropped to 72%!" Unlike the Counter, the Gauge's numbers go up and down. She reflects the current state of the world.

The Gauge's Personality

A Gauge represents a single numerical value that can arbitrarily go up and down. It's a snapshot of reality at any moment. Think of a thermometer, a fuel gauge, or your current queue depth.

The beautiful thing about gauges? The raw value is immediately meaningful. When someone asks "How much memory are we using?", the gauge has the answer.

When to Use a Gauge

Gauges excel at:

Current memory or CPU usage
Number of active connections
Queue depth
Temperature readings
Number of goroutines running
Disk space remaining

Gauge Characteristics

Property	Value
Direction	Can increase or decrease
Reset Behavior	Not applicable (always reflects current state)
Typical Suffix	None specific
Raw Value Usefulness	High (the current value is what you want)

Talking to the Gauge: PromQL Patterns

# Direct reading - totally valid and useful
node_memory_MemAvailable_bytes

# Calculate percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Average, min, max over time
avg_over_time(node_load1[1h])
max_over_time(node_load1[1h])
min_over_time(node_load1[1h])

# Predict the future: "When will we run out of disk?"
predict_linear(node_filesystem_avail_bytes[6h], 3600 * 24)

# Rate of change (unusual for gauges, but useful for capacity planning)
deriv(node_memory_MemAvailable_bytes[5m])

# Find the top consumers
topk(5, node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)

Gauge Alerts That Actually Work

# "Memory is running low"
- alert: HighMemoryUsage
  expr: |
    (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Memory usage above 90% on {{ $labels.instance }}"

# "Disk will fill up in 24 hours" - this is the kind of proactive alert that makes SREs heroes
- alert: DiskFillingUp
  expr: |
    predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24 * 3600) < 0
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Disk {{ $labels.mountpoint }} will fill within 24 hours"

# "Connection pool is almost exhausted"
- alert: ConnectionPoolNearExhaustion
  expr: db_pool_active_connections / db_pool_max_connections > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Connection pool is 80% utilized"

Histogram: The Distribution Detective

Now we get to the interesting ones. The Histogram is a detective who doesn't just count crimes; she categorizes them by severity and gives you the full picture.

"Out of 1000 requests," she reports, "150 completed in under 100ms, 700 completed in under 500ms, and 950 completed in under 1 second. The remaining 50 took longer."

This is the power of the Histogram. It doesn't just tell you the average. It shows you the distribution.

When to Use a Histogram

Histograms are perfect for:

Request latency (how long did API calls take?)
Response sizes
Any measurement where you need percentiles
When you need to aggregate percentiles across multiple pods (this is the killer feature)

Histogram Characteristics

Property	Value
Components	Three time series: `_bucket`, `_sum`, `_count`
Aggregation	Fully aggregatable across instances (this is huge!)
Configuration	Bucket boundaries must be defined upfront
Typical Suffix	`_seconds`, `_bytes`

The Histogram's Secret: Buckets

Here's what a histogram actually creates behind the scenes:

http_request_duration_seconds_bucket{le="0.1"}   --> 150 requests were <= 100ms
http_request_duration_seconds_bucket{le="0.5"}   --> 700 requests were <= 500ms
http_request_duration_seconds_bucket{le="1"}     --> 950 requests were <= 1s
http_request_duration_seconds_bucket{le="+Inf"}  --> 1000 requests total
http_request_duration_seconds_sum                --> Total time spent (e.g., 423.7 seconds)
http_request_duration_seconds_count              --> Total count (1000)

The le label means "less than or equal to." Buckets are cumulative.

Talking to the Histogram: PromQL Patterns

# Calculate the 50th percentile (median)
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# Calculate p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# P99 latency per endpoint (aggregated correctly!)
histogram_quantile(0.99, 
  sum by (le, endpoint) (rate(http_request_duration_seconds_bucket[5m]))
)

# Average request duration (simpler alternative)
rate(http_request_duration_seconds_sum[5m]) 
/ 
rate(http_request_duration_seconds_count[5m])

# "What percentage of requests complete in under 500ms?" (Apdex-style)
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m])) * 100

Histogram Alerts That Actually Work

# "P99 latency is too high"
- alert: HighP99Latency
  expr: |
    histogram_quantile(0.99, 
      sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
    ) > 2
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "P99 latency exceeds 2 seconds for {{ $labels.service }}"

# "Latency doubled compared to an hour ago"
- alert: LatencyDegradation
  expr: |
    histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
    >
    histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m] offset 1h))) * 2
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P95 latency is 2x higher than 1 hour ago"

# SLO violation: "Less than 99% of requests are fast"
- alert: SLOViolation
  expr: |
    sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30m]))
    /
    sum(rate(http_request_duration_seconds_count[30m])) < 0.99
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "SLO Violation: Less than 99% of requests complete within 500ms"

Summary: The Solo Performer

The Summary is the Histogram's cousin. She can also give you percentiles, but with one crucial difference: she calculates them herself, on the client side.

This makes her fast and precise for a single instance. But here's the catch: she can't collaborate. If you have 10 pods running, you cannot simply combine their percentiles to get a global percentile. Averaging p99s does not give you the true p99. It's mathematically wrong.

⚠️ The Summary Trap: I've seen teams spend hours debugging "wrong" percentiles, only to discover they were accidentally averaging Summary quantiles across instances. Don't be that team. If you need to aggregate, use Histograms.

When to Use a Summary

Summaries are appropriate when:

You genuinely only care about a single instance
You don't know bucket boundaries ahead of time
You're maintaining legacy code (most new projects should use Histograms)

Summary Characteristics

Property	Value
Components	Pre-calculated quantiles, plus `_sum` and `_count`
Aggregation	Cannot aggregate quantiles (only sum/count)
Percentile Calculation	Done on the client side
Typical Suffix	`_seconds`, `_bytes`

Talking to the Summary: PromQL Patterns

# Read quantiles directly (only meaningful per-instance)
http_request_duration_seconds{quantile="0.99"}

# Average latency - this DOES work across instances!
sum(rate(http_request_duration_seconds_sum[5m])) 
/ 
sum(rate(http_request_duration_seconds_count[5m]))

# DON'T DO THIS - averaging quantiles is mathematically wrong
# avg(http_request_duration_seconds{quantile="0.99"})

# If you must look at quantiles, do it per-instance
http_request_duration_seconds{quantile="0.99", instance="pod-1:8080"}

Comparison Matrix

Feature	Counter	Gauge	Histogram	Summary
Direction	Only up ⬆️	Up and down ↕️	N/A	N/A
Raw value useful	❌ No	✅ Yes	❌ No	Partial
Use rate()	Required	Rare	On buckets	On sum/count
Aggregatable	✅ Yes	✅ Yes	✅ Yes	⚠️ Only sum/count
Percentiles	❌ No	❌ No	✅ Server-side	✅ Client-side
Storage cost	Low	Low	Higher	Medium

PromQL Functions by Metric Type

Function	Counter	Gauge	Histogram	Summary
`rate()`	✅ Primary	❌ No	✅ On buckets	✅ On sum/count
`irate()`	✅ Yes	❌ No	✅ Yes	✅ Yes
`increase()`	✅ Yes	❌ No	✅ Yes	✅ Yes
`deriv()`	❌ No	✅ Yes	❌ No	❌ No
`delta()`	❌ No	✅ Yes	❌ No	❌ No
`predict_linear()`	❌ No	✅ Yes	❌ No	❌ No
`histogram_quantile()`	❌ No	❌ No	✅ Required	❌ No

Alerting Strategies

The Golden Signals

Google's SRE book teaches us to monitor four things. Here's how metric types map to them:

# 1. LATENCY (Histogram) - "How long do things take?"
- alert: HighLatency
  expr: histogram_quantile(0.99, sum by (le) (rate(http_duration_seconds_bucket[5m]))) > 1

# 2. TRAFFIC (Counter) - "How much are we doing?"
- alert: TrafficAnomaly
  expr: |
    abs(sum(rate(http_requests_total[5m])) - sum(rate(http_requests_total[5m] offset 1w)))
    / sum(rate(http_requests_total[5m] offset 1w)) > 0.5

# 3. ERRORS (Counter) - "How often do things fail?"
- alert: HighErrorRate
  expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01

# 4. SATURATION (Gauge) - "How full is our system?"
- alert: HighSaturation
  expr: avg by (instance) (1 - rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.9

SLO-Based Multi-Burn Rate Alerts

For the more advanced: burn rate alerts that catch both fast and slow burns of your error budget.

# Fast burn: 2% of monthly error budget consumed in 1 hour
- alert: SLOFastBurn
  expr: |
    (sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) > 14.4 * 0.001)
    and
    (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 14.4 * 0.001)
  labels:
    severity: critical

# Slow burn: Steady consumption over days
- alert: SLOSlowBurn
  expr: |
    (sum(rate(http_requests_total{status=~"5.."}[6h])) / sum(rate(http_requests_total[6h])) > 1 * 0.001)
    and
    (sum(rate(http_requests_total{status=~"5.."}[3h])) / sum(rate(http_requests_total[3h])) > 1 * 0.001)
  labels:
    severity: warning

Troubleshooting Quick Reference

When things go wrong at 3 AM, use this table:

General Issues (All Metric Types)

Symptom	Likely Cause	Fix	Debug Query
No data at all	Target not scraped	Check target status	`up{job="my-service"}`
Gaps in graph	Scrape failures	Check scrape duration	`scrape_duration_seconds{job="..."}`
Too many series	High cardinality	Add label filters	`topk(10, count by (__name__)({__name__!=""}))`

Counter Issues

Symptom	Likely Cause	Fix
Flat line	No events occurring	Check application logic
Sudden drops	Counter reset	Use `rate()` (it handles resets)
Negative rate	Label churn	Check for recreated series

Gauge Issues

Symptom	Likely Cause	Fix
Value unchanged	Stale metric	Check scrape status
Noisy graph	High variance	Use `avg_over_time()`
Wrong scale	Unit mismatch	Check metric units

Histogram Issues

Symptom	Likely Cause	Fix
Wrong percentile	Bad bucket boundaries	Add more buckets
Most values in +Inf	Buckets too small	Increase upper bounds
NaN result	No samples	Increase time window

Summary Issues

Symptom	Likely Cause	Fix
Wrong global p99	Averaged quantiles	Switch to Histogram

The Cardinality Monster

Let me tell you about the monster that has brought down more Prometheus instances than any other: cardinality.

Cardinality is the number of unique time series in your system. And it can explode faster than you think.

How Cardinality Explodes

Every unique combination of labels creates a new time series:

1 metric × 5 methods × 10 status codes × 100 endpoints × 50 instances
= 250,000 time series from ONE metric

Labels That Will Destroy Your Prometheus

Never use these as labels:

Label Type	Example	Why It's Bad
User IDs	`user_id="12345"`	Millions of values
Request IDs	`request_id="abc-123"`	One per request
Timestamps	`timestamp="2024-01-01"`	Infinite growth
IP addresses	`client_ip="192.168.1.1"`	Thousands of values
Session tokens	`session="..."`	One per session
Error messages	`error="Connection refused..."`	Unbounded strings

Detecting the Monster

# How bad is it? Count all series.
count({__name__!=""})

# Find the offenders
topk(10, count by (__name__) ({__name__!=""}))

# Check per-label cardinality
count by (endpoint) (http_requests_total)

Cardinality Guidelines

Level	Series Count	Action
🟢 Low	Under 1,000	You're fine
🟡 Moderate	1K - 10K	Monitor it
🟠 High	10K - 100K	Investigate
🔴 Critical	Over 100K	Fix immediately

Best Practices

Do These Things

Always use rate() with counters - Raw values are useless
Set rate window to 2-4x scrape interval - Ensures enough data points
Include le in your by clause before histogram_quantile()
Use histograms for percentiles - They aggregate correctly
Add for duration to alerts - Prevents flapping
Define bucket boundaries based on SLOs - Know what matters

Avoid These Mistakes

Averaging summary quantiles - Mathematically wrong
Using irate() for alerting - Too volatile
Alerting on raw gauge spikes - Use for duration
High cardinality labels - They'll kill your Prometheus
avg_over_time(rate(...)) - Just use a larger rate window

References

Conclusion

So here we are. It's 4:15 AM, but you're no longer panicking.

You know that the Counter is your reliable bookkeeper, always tallying but never forgetting. You query her with rate().

You know that the Gauge is your live reporter, giving you the current state. Her raw values make sense.

You know that the Histogram is your distribution detective, revealing the patterns in your latency. She aggregates correctly across all your pods.

And you know to be careful with the Summary, the solo performer who can't collaborate across instances.

Most importantly, you've learned to respect the Cardinality Monster and keep him caged.

The pager may buzz again. But next time, you'll know exactly what you're looking at.

Now go get some sleep. You've earned it.

Platform Engineering Principles

Sunny Nazar — Thu, 29 May 2025 14:41:06 +0000

In today’s fast-paced, cloud-native world, Platform Engineering has emerged as a critical discipline for delivering self-service, scalable, reliable, secure, cost-optimized, and efficient software delivery platforms. Whether you’re building internal developer platforms, shared infrastructure, or enabling DevOps practices, a well-designed platform has become the backbone of modern engineering organizations.

But what exactly makes a platform successful? What are the core principles to keep in mind when building such platforms? And what are the common pitfalls to avoid?

Let’s dive deep into the fundamental principles that underpin effective and sustainable platform engineering.

Core Principles of Platform Engineering

Developer Experience (DX)
Security and Compliance
Multi-Tenancy and Isolation
Observability and Transparency
Automation and Self-Healing
Scalability and Reliability
Standards and Governance
Cost Awareness
Feedback Loops and Continuous Improvement
Modularity and Extensibility
Documentation

Common Challenges and Bonus Points

Common Challenges in Platform Engineering
Bonus Points

Core Principles of Platform Engineering

Developer Experience (DX)

A platform exists to empower developers. The best platforms reduce friction, simplify workflows, and enable teams to deliver faster and more reliably. This means:

Self-service capabilities (e.g., provisioning, deployments).
Golden paths that provide standardized, pre-approved templates.
Clear and Accessible documentation.
Intuitive UIs and APIs.

Happy developers are productive developers. Prioritize their experience.

Security and Compliance

Security should not be an afterthought. Platforms must:

Enforce least privilege access and zero trust principles (e.g., SSO, SCP)
Automate policy checks (e.g., OPA/Gatekeeper, Kyverno).
Secure secrets management (e.g., HashiCorp Vault, AWS Secrets Manager).
Maintain audit trails and compliance logs. (e.g., Cloudtrail)

By baking security into the platform itself, you minimize risks and simplify compliance.

Multi-Tenancy and Isolation

When multiple teams or products share a platform:

Isolate workloads with namespaces, network policies, and resource quotas and network segmentation (VPC/VNET).
Implement tenant-specific RBAC.
Ensure fair usage to prevent noisy neighbor issues.

Tenant isolation is critical for security and performance.

Observability and Transparency

A platform must be transparent in its operations:

Centralized logging, metrics, and tracing (e.g., Prometheus, Grafana, Loki, OpenTelemetry).
Dashboards for both platform engineers and end users.
Real-time alerts and root cause analysis.

Observability helps diagnose issues quickly and keeps everyone informed and brings in the transparency in the platform.

Automation and Self-Healing

Reduce manual toil by automating:

Infrastructure provisioning (e.g., Terraform).
CI/CD pipelines (e.g., GitHub Actions, ArgoCD).
Remediation of failures, scaling, and resource management.

Platforms should be self-healing and resilient by default.

Scalability and Reliability

A platform must:

Scale horizontally as demand grows.
Handle failures gracefully with retries, circuit breakers, and failovers.
Provide service level objectives (SLOs) and error budgets to manage reliability.

Reliable platforms build trust and Scalability helps in the expansion.

Standards and Governance

Consistency accelerates delivery by defining:

Golden paths with approved tech stacks and best practices.
Re-use existing solutions and industry-standard tools wherever possible. Avoid reinventing the wheel, it’s often more efficient to adopt battle-tested patterns and tools than to build everything from scratch.
Code and configuration linting and validation.
Governance policies and automated enforcement.

By providing paved roads, teams can focus on innovation, not reinvention.

Cost Awareness

Platform costs can spiral out of control. Adopt:

Resource optimization and rightsizing. (e.g., Karpenter)
Cost visibility dashboards.
Fair usage policies and showback/chargeback models.

Cost-efficient platforms are sustainable.

Feedback Loops and Continuous Improvement

Listen to your users and developers, and iterate:

Gather feedback via surveys, interviews, or support tickets.
Measure adoption, usage, and friction points.
Prioritize enhancements and bug fixes.

A great platform evolves with its users.

Modularity and Extensibility

Avoid monoliths. Instead:

Build modular, loosely coupled components.
Adopt an API-first approach: Design your platform’s capabilities as well-defined APIs that can be consumed by both internal and external systems. This promotes reusability, integration, and flexibility. APIs should be well-documented, versioned, and governed.
Allow for extensibility and plugin models.
Support gradual adoption and migration.

Modular platforms adapt better to changing needs.

Documentation

Documentation is as important as code:

Provide step-by-step guides, tutorials, and FAQs.
Keep documentation up to date and discoverable.
Offer workshops and internal community support.

Knowledge-sharing accelerates adoption.

Common Challenges in Platform Engineering

While the principles provide a solid foundation, platform engineering comes with its own set of challenges. Here are some common pitfalls and how to mitigate them:

Over-Engineering

It’s tempting to design a "perfect" platform with every feature imaginable. But this often leads to complexity and low adoption. Start small, deliver value early, and iterate based on the users feedback.

Neglecting Developer Experience

A platform that’s hard to use will be ignored. Ensure a clear focus on usability, documentation, and support, and involve developers early in the design process.

Security as an Afterthought

Retrofitting security is expensive and risky. Integrate it from the start, automate checks, enforce the least privilege, and audit all access.

Lack of Observability

Without good logging, monitoring, and tracing, troubleshooting becomes a nightmare. Prioritize observability as a first-class citizen of the platform.

Rigid Governance

While standards are essential, being too rigid stifles innovation. Provide "golden paths" but allow for "escape hatches" when teams need flexibility.

Ignoring Costs

Platforms can become expensive, especially at scale. Regularly review usage, optimize resource allocation, and implement cost transparency.

Underestimating Change Management

Introducing a platform often means changing how teams work. Invest in onboarding, training, and support to drive adoption and reduce resistance.

Bonus Points

A successful platform isn’t just about technology, it’s about empowering teams, reducing friction, ensuring security, and enabling innovation.

Also, just don't follow the trend's blindly, see what fits best as per the developer needs and aids in an effective software delivery. (e.g., if you are a small platform team (2-3 members) and serve a handful of development teams, focus on solving the real problems by understanding their needs and not just blindly adopting a trending technology like Kubernetes (keep operational overhead, complexity and steep learning curve in mind).

By embracing these principles and being mindful of common challenges, platform engineers can build systems that not only scale technically but also foster a culture of collaboration, ownership, and excellence.

Whether you’re just starting your platform journey or scaling an existing one, these principles provide a solid foundation for sustainable success.

AWS LAMBDA BEST PRACTICES

Sunny Nazar — Fri, 31 Mar 2023 15:52:48 +0000

Overview
Best Practices
- Right language, Small functions and Trigger type
- Lambda Layers
- Optimize cold start times
- Environment Variables for Configuration
- Concurrency setting
- Use the right memory and CPU settings
- Secure your Lambda functions
- Dead Letter Queue (DLQ) and Retries for error handling
- Testing, Versioning and Aliases for Deployment
- Error Handling, Logging, Monitoring, Tracing
Documentation Links
Conclusion

Overview

AWS Lambda is a serverless computing platform that allows you to run your code in response to events and only pay for the compute time consumed. With Lambda, you can build and deploy applications without worrying about the underlying infrastructure. However, like any other technology, there are best practices that you can follow to ensure that you get the most out of it. In this blog, we'll look at some AWS Lambda best practices.

Best Practices

Right language, Small functions and Trigger type

AWS Lambda natively supports various programming languages like Java, Go, PowerShell, Node.js, C#, Python, and Ruby code. Lambda also provides Runtime API which allows you to use any additional programming languages to create your functions.When choosing a language for your function, please consider your use case and the language's strengths.

AWS Lambda is designed to run small, focused functions. When building your functions, try to keep them as small as possible and focused on a single task. This makes it easier to test, deploy, and maintain your code. Please note that you can run Lambda functions for only 15 minutes.

AWS Lambda supports different trigger types, such as API Gateway, S3, and CloudWatch Events. Choose the right trigger type for your function based on your use case and expected workload.If you follow event-driven architecture that should already help you in choosing right trigger type.

Lambda Layers

If you have code that is shared across multiple functions, please consider using AWS Lambda Layers to manage it. A layer is a ZIP archive that contains libraries, custom runtimes, or other function code. You can use layers to manage dependencies, reduce the size of your function deployment packages, and simplify your code maintenance.

Optimize cold start times

Cold start times can impact the performance of your Lambda functions, especially for infrequently used functions. Optimize your code and use the right runtime to reduce cold start times. Some tips could be :

Reduce the size of your deployment package.
Use a language that has faster startup time.
Use provisioned concurrency.
Optimize resource allocation.

Environment Variables for Configuration

When building your functions, you may need to configure them with environment variables, such as API keys or database connection credentials.Use environment variables to store configuration settings instead of hard-coding them in your function's code. This makes it easier to manage your configuration and update it as needed. Best practice is to make use of SSM Parameter Store and Secrets Managers.

Concurrency setting

Configure your Lambda function with the right concurrency settings to handle incoming requests. Below tips will help you to have right settings.

Understand your application's requirements: The first step in setting concurrency is to understand your application's requirements. Determine how many requests per second your application needs to handle, and set the concurrency limit accordingly.
Use auto-scaling: AWS Lambda can automatically scale the number of concurrent executions based on the number of requests coming in. By enabling auto-scaling, you can ensure that your functions are able to handle bursts of traffic without being overwhelmed.
Reserve concurrency: The default value of concurrent Lambda functions in an AWS account in a region is 1000. This means that by default, up to 1000 requests can be processed simultaneously across all Lambda functions in that region. Reserving concurrency allows you to ensure that a certain number of executions are always available, even when other functions are using up the concurrency pool. This can be useful for functions that need to respond quickly to requests, such as real-time applications.
Monitor and adjust: It's important to monitor the concurrency usage of your functions and adjust the concurrency limit accordingly. If you're consistently hitting the concurrency limit, consider increasing it. Conversely, if you're consistently underutilizing your concurrency, consider reducing the limit to save costs.

Use the right memory and CPU settings

Configure your Lambda function with the right amount of memory and CPU to ensure optimal performance. This will depend on your workload, so be sure to test your functions under different load conditions and scenarios. Best practice is start with minimum required cpu and memory settings.And as you test your function, adjust these settings accordingly.

Secure your Lambda functions

Use AWS Identity and Access Management (IAM) to restrict access to your Lambda functions (using least privilege access) and use encryption to protect your data at rest and in transit. Also make sure IAM role needed for lambda function follows the least privilage access principal.

Dead Letter Queue (DLQ) and Retries for error handling

Retries in Lambda functions refer to the number of times AWS Lambda will automatically retry a function invocation in case of a function error. By default, AWS Lambda retries function invocations twice, with an exponential backoff in between retries.
A DLQ is a queue where AWS Lambda can send failed or discarded messages, which can be used for further analysis or processing. Configure a DLQ for your Lambda function to handle errors more effectively and prevent data loss. This is particularly useful when using asynchronous event sources like SNS or Kinesis.

Testing, Versioning and Aliases for Deployment

Test your Lambda functions thoroughly before deploying them to production. Use a combination of unit tests, integration tests, and end-to-end tests to ensure that your functions are working as expected.

When deploying your functions, use versioning and aliases to manage your code. Versioning allows you to create and manage multiple versions of your function code, while aliases provide a consistent name for your function's entry point. This makes it easier to manage deployments and rollbacks.

Use a deployment pipeline to automate the process of building, testing, and deploying your Lambda functions. This can help you release new features and updates more frequently and with less risk.

Error Handling, Logging, Monitoring, Tracing

AWS Lambda Power tools to simplify your code:
AWS Lambda Power tools is a set of open-source utilities and libraries that help simplify your code and improve observability. It includes modules for logging, error handling, metrics, and tracing, and can help reduce the amount of boilerplate code you need to write.
Monitor Your Functions for Performance and Errors:
AWS Lambda integrates with CloudWatch Metrics, which allows you to monitor your functions for performance and errors. Make sure that you configure your metrics to track the right metrics and set up alarms to notify you of any issues.
Use AWS X-Ray for tracing: Use AWS X-Ray to trace requests through your Lambda function and other AWS services. This can help you identify performance bottlenecks and troubleshoot issues more easily.
Use Logging for Debugging:
When developing your functions, use logging to help you debug issues. AWS Lambda integrates with CloudWatch Logs, which allows you to view and analyze your logs in real-time. Make sure that your logging is comprehensive and includes useful information, such as error messages and input parameters.

Documentation Links

Conclusion

AWS Lambda is a powerful tool for building and deploying serverless applications. By following these best practices, you can ensure that your functions are scalable, secure, and easy to manage. With these best practices, you can build robust and reliable serverless applications on AWS Lambda.

Securely Access Your EC2 Instances with AWS Systems Manager SSM and VPC Endpoints

Sunny Nazar — Wed, 29 Mar 2023 15:00:19 +0000

Overview
Background Knowledge
- What is SSH-Less Login?
- What is AWS Systems Manager (SSM)?
- How to Use SSM for SSH-Less Login?
Terraform code
Documentation Links
Conclusion

Overview

As more and more organizations adopt cloud computing, managing resources on cloud platforms like Amazon Web Services (AWS) becomes increasingly important. The need to manage multiple instances of Amazon Elastic Compute Cloud (EC2) instances effectively has led to the development of various tools to simplify the process. One such tool is the AWS Systems Manager (SSM), which enables users to manage EC2 instances, as well as other AWS resources, using a single interface. One of the most powerful features of SSM is the ability to perform SSH-less login to EC2 machines, which we will explore in this blog.

Background Knowledge

What is SSH-Less Login?

Traditionally, logging into an EC2 instance involves connecting via SSH with a username and password or a key pair. However, managing SSH keys can be challenging, particularly when dealing with multiple EC2 instances. SSH-Less login, on the other hand, is a secure and more efficient method of accessing EC2 instances without requiring SSH keys.

What is AWS Systems Manager (SSM)?

AWS Systems Manager (SSM) is a management service that enables users to automate the management of their EC2 instances and other AWS resources. SSM enables users to perform various tasks, including software installation, patching, and maintenance across a fleet of EC2 instances. It also provides a single interface to manage EC2 instances running in different regions and accounts.

How to Use SSM for SSH-Less Login?

To use SSM for SSH-less login, follow the steps below:

Security Group for EC2 Instance: The minimum traffic you need to allow for SSM access to work is to add an Outbound HTTPS (port 443) in the security group for EC2 instance.

Create an IAM Role: To use SSM to log in to EC2 instances, you must first create an IAM role with the required permissions. The role must have the AmazonEC2RoleforSSM policy attached to it, which allows SSM to access the EC2 instances.

Install SSM Agent: After creating the IAM role, you need to install the SSM agent on each EC2 instance you want to access using SSM. The SSM agent is pre-installed on Amazon Linux 2 and Amazon Linux AMIs, but you must install it manually on other instances.

Configure EC2 Instances: Once the SSM agent is installed, you need to configure your EC2 instances to allow SSM access. You can do this by creating a VPC endpoint for SSM. VPC endpoints which are required when using Private Subnets are below:

com.amazonaws.region.ec2messages
com.amazonaws.region.ssmmessages
com.amazonaws.region.ssm
com.amazonaws.region.kms (This is needed if you want to use AWS KMS encryption for Session Manager.)

Note:The security group for VPC Endpoints must allow inbound HTTPS (port 443) traffic from the resources in your VPC that communicate with the service.

Terraform code

Let's first start with creating VPC, Public Subnet, Private Subnet, Internet Gateway, Nat Gateway and Route tables.

Prerequisite - Create provider configuration.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.60.0"
    }
  }
}

provider "aws" {
  region = var.region
}

Variable definition can be done like this:

# Please set variable region as per your needs.
variable "region" {
  type        = string
  description = "Region for the resource deployment"
  default     = "eu-central-1"
}

# Create a VPC
resource "aws_vpc" "vpc" {
  cidr_block = "10.0.0.0/16"
  tags = {
    Name = "vpc-${var.region}"
  }
}

# Create an internet gateway
resource "aws_internet_gateway" "gw" {
  vpc_id = aws_vpc.vpc.id
  tags = {
    Name = "igw-${var.region}"
  }
}

# Create a public subnet
resource "aws_subnet" "public_subnet" {
  vpc_id            = aws_vpc.vpc.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "${var.region}a"
  tags = {
    Name = "Public Subnet"
  }
}

# Create a private subnet
resource "aws_subnet" "private_subnet" {
  vpc_id            = aws_vpc.vpc.id
  cidr_block        = "10.0.2.0/24"
  availability_zone = "${var.region}a"
  tags = {
    Name = "Private Subnet"
  }
}

# Create a NAT gateway
resource "aws_nat_gateway" "nat_gateway" {
  allocation_id = aws_eip.nat_eip.id
  subnet_id     = aws_subnet.public_subnet.id
  tags = {
    Name = "ngw-${var.region}"
  }
}

# Create an EIP for the NAT gateway
resource "aws_eip" "nat_eip" {
  vpc = true
}

# Create a public route table and associate it with the public subnet
resource "aws_route_table" "public_route_table" {
  vpc_id = aws_vpc.vpc.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.gw.id
  }
  tags = {
    Name = "Public route table"
  }
}

resource "aws_route_table_association" "public_route_table_association" {
  subnet_id      = aws_subnet.public_subnet.id
  route_table_id = aws_route_table.public_route_table.id
}

# Create a private route table and associate it with the private subnet
resource "aws_route_table" "private_route_table" {
  vpc_id = aws_vpc.vpc.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat_gateway.id
  }
  tags = {
    Name = "Private route table"
  }
}

resource "aws_route_table_association" "private_route_table_association" {
  subnet_id      = aws_subnet.private_subnet.id
  route_table_id = aws_route_table.private_route_table.id
}

Let's now create EC2 and Endpoint Security Group

# Create a security group for the EC2 instance
resource "aws_security_group" "instance_security_group" {
  name_prefix = "instance-sg"
  vpc_id      = aws_vpc.vpc.id
  description = "security group for the EC2 instance"

  # Allow outbound HTTPS traffic
  egress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
    description = "Allow HTTPS outbound traffic"
  }

  tags = {
    Name = "EC2 Instance security group"
  }
}

# Security group for VPC Endpoints
resource "aws_security_group" "vpc_endpoint_security_group" {
  name_prefix = "vpc-endpoint-sg"
  vpc_id      = aws_vpc.vpc.id
  description = "security group for VPC Endpoints"

  # Allow inbound HTTPS traffic
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [aws_vpc.vpc.cidr_block]
    description = "Allow HTTPS traffic from VPC"
  }

  tags = {
    Name = "VPC Endpoint security group"
  }
}

Now we can create VPC Endpoints

locals {
  endpoints = {
    "endpoint-ssm" = {
      name = "ssm"
    },
    "endpoint-ssmm-essages" = {
      name = "ssmmessages"
    },
    "endpoint-ec2-messages" = {
      name = "ec2messages"
    }
  }
}

resource "aws_vpc_endpoint" "endpoints" {
  vpc_id            = aws_vpc.vpc.id
  for_each          = local.endpoints
  vpc_endpoint_type = "Interface"
  service_name      = "com.amazonaws.${var.region}.${each.value.name}"
  # Add a security group to the VPC endpoint
  security_group_ids = [aws_security_group.vpc_endpoint_security_group.id]
}

After creating endpoints, the final components are Instance profile and EC2 instance.

# Create IAM role for EC2 instance
resource "aws_iam_role" "ec2_role" {
  name = "EC2_SSM_Role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect    = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
        Action = "sts:AssumeRole"
      }
    ]
  })
}

# Attach AmazonSSMManagedInstanceCore policy to the IAM role
resource "aws_iam_role_policy_attachment" "ec2_role_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  role       = aws_iam_role.ec2_role.name
}

# Create an instance profile for the EC2 instance and associate the IAM role
resource "aws_iam_instance_profile" "ec2_instance_profile" {
  name = "EC2_SSM_Instance_Profile"

  roles = [aws_iam_role.ec2_role.name]
}

data "aws_ami" "amazon_linux_2_ssm" {
  most_recent = true

  filter {
    name   = "owner-alias"
    values = ["amazon"]
  }

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-ebs"]
  }
}

# Create EC2 instance
resource "aws_instance" "ec2_instance" {
  ami           = data.aws_ami.amazon_linux_2_ssm.id
  instance_type = "t2.micro"
  subnet_id     = aws_subnet.private_subnet.id
  vpc_security_group_ids = [
    aws_security_group.instance_security_group.id,
  ]
  iam_instance_profile = aws_iam_instance_profile.ec2_instance_profile.name
}

Access EC2 Instance using SSM: After completing the above steps, you can access your EC2 instances using SSM without requiring an SSH key. To do this, navigate to the EC2 console and select the instance you want to access. Then, click on the "Connect" button and select "Session Manager" from the dropdown menu. This will open a web-based shell that allows you to interact with the instances.

Documentation Links

Conclusion

Using SSM for SSH-less login provides a secure and efficient way to manage multiple EC2 instances without the need for managing SSH keys. SSM makes it easy to perform tasks like software installation, patching, and maintenance across a fleet of EC2 instances using a single interface. With the steps outlined above, you can easily set up SSH-less login for your EC2 instances and enjoy the benefits of streamlined instance management.

DEV Community: Sunny Nazar

The Complete Guide to Prometheus Metric Types

The Complete Guide to Prometheus Metric Types: PromQL, Alerting and Troubleshooting

Table of Contents

The 3 AM Call

Quick Reference Card

Which Metric Type Should I Use

Meet the Four Metric Types

Counter: The Tireless Bookkeeper

The Counter's Personality

When to Use a Counter

Counter Characteristics

Talking to the Counter: PromQL Patterns

Counter Alerts That Actually Work

Gauge: The Live Reporter

The Gauge's Personality

When to Use a Gauge

Gauge Characteristics

Talking to the Gauge: PromQL Patterns

Gauge Alerts That Actually Work

Histogram: The Distribution Detective

When to Use a Histogram

Histogram Characteristics

The Histogram's Secret: Buckets

Talking to the Histogram: PromQL Patterns

Histogram Alerts That Actually Work

Summary: The Solo Performer

When to Use a Summary

Summary Characteristics

Talking to the Summary: PromQL Patterns

Comparison Matrix

PromQL Functions by Metric Type

Alerting Strategies

The Golden Signals

SLO-Based Multi-Burn Rate Alerts

Troubleshooting Quick Reference

General Issues (All Metric Types)

Counter Issues

Gauge Issues

Histogram Issues

Summary Issues

The Cardinality Monster

How Cardinality Explodes

Labels That Will Destroy Your Prometheus

Detecting the Monster

Cardinality Guidelines

Best Practices

Do These Things

Avoid These Mistakes

References

Conclusion

Platform Engineering Principles

Table of Contents

Core Principles of Platform Engineering

Common Challenges and Bonus Points

Core Principles of Platform Engineering

Developer Experience (DX)

Security and Compliance

Multi-Tenancy and Isolation

Observability and Transparency

Automation and Self-Healing

Scalability and Reliability

Standards and Governance

Cost Awareness

Feedback Loops and Continuous Improvement

Modularity and Extensibility

Documentation

Common Challenges in Platform Engineering

Over-Engineering

Neglecting Developer Experience

Security as an Afterthought

Lack of Observability

Rigid Governance

Ignoring Costs

Underestimating Change Management

Bonus Points

AWS LAMBDA BEST PRACTICES

Overview

Best Practices

Right language, Small functions and Trigger type