DEV Community

Taverne Tech
Taverne Tech

Posted on

Metrics + Logs = Zero Panic in Production 🚨

Don't hesitate to check all the article on my blog — Taverne Tech!

Introduction

Did you know that in 2023, one hour of downtime at a large tech company can cost up to $5 million? 💸 And here’s the chilling fact: 80% of outages are detected by end users before technical teams notice!

Today, we’ll discover why metrics and logs aren’t just nice-to-haves, but your superpowers to dominate production like a true DevOps ninja! 🥷


1. Metrics: Your Magic Dashboard ✨

Metrics are like your car’s dashboard: without them, you don’t know your speed or whether you’re about to run out of fuel! In DevOps, they give you a real-time, high-level view of your system’s health.

The 4 Golden Signals 🏆

  1. Latency: How long it takes to serve a request
  2. Traffic: Number of requests per second
  3. Errors: Request failure rate
  4. Saturation: Resource utilization

Here’s a concrete example using Prometheus and Go:

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "Duration of HTTP requests in seconds",
        },
        []string{"method", "endpoint"},
    )

    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
)

func init() {
    prometheus.MustRegister(requestDuration, requestsTotal)
}

func metricsMiddleware(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        timer := prometheus.NewTimer(requestDuration.WithLabelValues(r.Method, r.URL.Path))
        defer timer.ObserveDuration()

        next(w, r)
        requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
    }
}
Enter fullscreen mode Exit fullscreen mode

Surprising fact: Google uses more than 10 million different metrics to monitor its services! 🤯 Don’t worry though — you can start with just 5–10 key metrics.


2. Logs: The Diary of Your Applications 📖

Logs are the personal diary of your application! They tell the full story: who did what, when, and sometimes why everything blew up.

Historical anecdote: The term “bug” in computing comes from Grace Hopper, who in 1947 found a real moth stuck in a relay of the Mark II computer! The very first “log entry” was literally… an insect taped into a notebook! 🦋

Log Levels: A Well-Designed Hierarchy

package main

import (
    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
    "os"
)

func init() {
    // Structured logger configuration
    zerolog.TimeFieldFormat = zerolog.TimeFormatUnix
    log.Logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr})
}

func processOrder(orderID string, customerID string) error {
    log.Info().
        Str("order_id", orderID).
        Str("customer_id", customerID).
        Msg("Processing new order")

    // Simulated error
    if orderID == "invalid" {
        log.Error().
            Str("order_id", orderID).
            Str("error", "invalid_order_format").
            Msg("Failed to process order")
        return errors.New("invalid order")
    }

    log.Info().
        Str("order_id", orderID).
        Float64("processing_time_ms", 125.5).
        Msg("Order processed successfully")

    return nil
}
Enter fullscreen mode Exit fullscreen mode

Mind-blowing statistic: Companies generate an average of 2.5 quintillion bytes of log data per day! That’s equivalent to 90 years of HD video… every single day! 😱


3. The Art of Correlation: When Sherlock Holmes Meets DevOps 🕵️

Having metrics and logs separately is like having all the clues of an investigation, but stored in different boxes! The real power comes from correlating the data.

Distributed Tracing: Following the Trail 🔍

package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

var tracer = otel.Tracer("ecommerce-service")

func processPayment(ctx context.Context, amount float64, userID string) error {
    // Create a span to trace the operation
    ctx, span := tracer.Start(ctx, "process-payment")
    defer span.End()

    // Add attributes to simplify debugging
    span.SetAttributes(
        attribute.Float64("payment.amount", amount),
        attribute.String("user.id", userID),
    )

    // Simulated service calls
    if err := validatePayment(ctx, amount); err != nil {
        span.RecordError(err)
        return err
    }

    if err := chargeCard(ctx, amount, userID); err != nil {
        span.RecordError(err)
        return err
    }

    span.SetAttributes(attribute.String("payment.status", "success"))
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Staggering revelation: Teams using full observability (metrics + logs + traces) resolve incidents 3× faster and reduce mean time to detection by 95%! 🚀

The Golden Rule of Alerts

# Prometheus configuration for smart alerts
groups:
- name: ecommerce-alerts
  rules:
  - alert: HighErrorRate
    expr: (rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])) > 0.05
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }}"

  - alert: ResponseTimeHigh
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Response time is too high"
Enter fullscreen mode Exit fullscreen mode

Conclusion

Metrics and logs aren’t just technical tools — they are your eyes and ears in production! Without them, you’re like a pilot flying through fog, hoping everything turns out fine. 🤞

Let’s recap the key points:

  • Metrics give you the big picture in real time
  • Logs tell the full story of what happened
  • Correlating both turns you into a production detective

Investing in good monitoring may seem expensive, but remember: one minute of downtime costs an average of $9,000! Would your boss rather invest in tools — or explain to customers why the site is down? 😅

So, are you ready to switch from “firefighter mode” to “preventive mode”? Start small: implement a few core metrics, add structured logging, and watch the magic happen! ✨

Your future self (and your team) will thank you when you resolve the next incident in 5 minutes instead of 5 hours! 🎯


buy me a coffee

Top comments (0)