Don't hesitate to check all the article on my blog — Taverne Tech!
Introduction
Did you know that in 2023, one hour of downtime at a large tech company can cost up to $5 million? 💸 And here’s the chilling fact: 80% of outages are detected by end users before technical teams notice!
Today, we’ll discover why metrics and logs aren’t just nice-to-haves, but your superpowers to dominate production like a true DevOps ninja! 🥷
1. Metrics: Your Magic Dashboard ✨
Metrics are like your car’s dashboard: without them, you don’t know your speed or whether you’re about to run out of fuel! In DevOps, they give you a real-time, high-level view of your system’s health.
The 4 Golden Signals 🏆
- Latency: How long it takes to serve a request
- Traffic: Number of requests per second
- Errors: Request failure rate
- Saturation: Resource utilization
Here’s a concrete example using Prometheus and Go:
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests in seconds",
},
[]string{"method", "endpoint"},
)
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
)
func init() {
prometheus.MustRegister(requestDuration, requestsTotal)
}
func metricsMiddleware(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(requestDuration.WithLabelValues(r.Method, r.URL.Path))
defer timer.ObserveDuration()
next(w, r)
requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
}
Surprising fact: Google uses more than 10 million different metrics to monitor its services! 🤯 Don’t worry though — you can start with just 5–10 key metrics.
2. Logs: The Diary of Your Applications 📖
Logs are the personal diary of your application! They tell the full story: who did what, when, and sometimes why everything blew up.
Historical anecdote: The term “bug” in computing comes from Grace Hopper, who in 1947 found a real moth stuck in a relay of the Mark II computer! The very first “log entry” was literally… an insect taped into a notebook! 🦋
Log Levels: A Well-Designed Hierarchy
package main
import (
"github.com/rs/zerolog"
"github.com/rs/zerolog/log"
"os"
)
func init() {
// Structured logger configuration
zerolog.TimeFieldFormat = zerolog.TimeFormatUnix
log.Logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr})
}
func processOrder(orderID string, customerID string) error {
log.Info().
Str("order_id", orderID).
Str("customer_id", customerID).
Msg("Processing new order")
// Simulated error
if orderID == "invalid" {
log.Error().
Str("order_id", orderID).
Str("error", "invalid_order_format").
Msg("Failed to process order")
return errors.New("invalid order")
}
log.Info().
Str("order_id", orderID).
Float64("processing_time_ms", 125.5).
Msg("Order processed successfully")
return nil
}
Mind-blowing statistic: Companies generate an average of 2.5 quintillion bytes of log data per day! That’s equivalent to 90 years of HD video… every single day! 😱
3. The Art of Correlation: When Sherlock Holmes Meets DevOps 🕵️
Having metrics and logs separately is like having all the clues of an investigation, but stored in different boxes! The real power comes from correlating the data.
Distributed Tracing: Following the Trail 🔍
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("ecommerce-service")
func processPayment(ctx context.Context, amount float64, userID string) error {
// Create a span to trace the operation
ctx, span := tracer.Start(ctx, "process-payment")
defer span.End()
// Add attributes to simplify debugging
span.SetAttributes(
attribute.Float64("payment.amount", amount),
attribute.String("user.id", userID),
)
// Simulated service calls
if err := validatePayment(ctx, amount); err != nil {
span.RecordError(err)
return err
}
if err := chargeCard(ctx, amount, userID); err != nil {
span.RecordError(err)
return err
}
span.SetAttributes(attribute.String("payment.status", "success"))
return nil
}
Staggering revelation: Teams using full observability (metrics + logs + traces) resolve incidents 3× faster and reduce mean time to detection by 95%! 🚀
The Golden Rule of Alerts
# Prometheus configuration for smart alerts
groups:
- name: ecommerce-alerts
rules:
- alert: HighErrorRate
expr: (rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: ResponseTimeHigh
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Response time is too high"
Conclusion
Metrics and logs aren’t just technical tools — they are your eyes and ears in production! Without them, you’re like a pilot flying through fog, hoping everything turns out fine. 🤞
Let’s recap the key points:
- Metrics give you the big picture in real time
- Logs tell the full story of what happened
- Correlating both turns you into a production detective
Investing in good monitoring may seem expensive, but remember: one minute of downtime costs an average of $9,000! Would your boss rather invest in tools — or explain to customers why the site is down? 😅
So, are you ready to switch from “firefighter mode” to “preventive mode”? Start small: implement a few core metrics, add structured logging, and watch the magic happen! ✨
Your future self (and your team) will thank you when you resolve the next incident in 5 minutes instead of 5 hours! 🎯

Top comments (0)