DEV Community

CESAR NIKOLAS CAMAC MELENDEZ
CESAR NIKOLAS CAMAC MELENDEZ

Posted on

πŸ” Observability Practices: A Practical Guide With Real-World Examples

Modern software systems are more distributed, dynamic, and complex than ever. Microservices, serverless functions, containers, and event-driven architectures make traditional monitoring insufficient.
To understand what is actually happening inside your application, you need observability.

This article explains the foundational pillars of observability, best practices, common tools, and includes a hands-on real-world example using Grafana + Prometheus for metrics and ELK Stack for logs.


🚦 What Is Observability?

Observability is the ability to understand the internal state of a system based solely on the data it produces, such as metrics, logs, and traces.

Unlike classical monitoring, which answers β€œIs the system up?”, observability answers:

  • Why is the system slow?
  • Where is the latency coming from?
  • Which dependency failed?
  • What changed recently that caused errors?

πŸ“Š The Three Pillars of Observability

1. Metrics

Numeric measurements collected over time.
Examples:

  • CPU usage
  • Request throughput
  • Error rate
  • Database latency

Tools: Prometheus, Grafana, Datadog Metrics, AWS CloudWatch Metrics.


2. Logs

Immutable records about events happening in the system.
Examples:

  • Info logs
  • Warnings
  • Errors
  • Audit logs

Tools: ELK Stack, Datadog Logs, Azure Log Analytics, Loki.


3. Traces

A trace follows a single request across distributed components.
Examples:

  • Microservices call chain
  • Latency across services
  • Errors from downstream dependencies

Tools: Jaeger, Zipkin, OpenTelemetry, AWS X-Ray, New Relic.


🧱 Core Observability Practices

βœ”οΈ 1. Use Structured Logging

Instead of plain text logs, use JSON.

{
  "timestamp": "2025-11-23T10:15:20Z",
  "level": "ERROR",
  "service": "orders-api",
  "message": "Payment gateway timeout",
  "orderId": 10422,
  "durationMs": 3200
}
Enter fullscreen mode Exit fullscreen mode

Structured logs allow better filtering and search on platforms like ELK, Datadog, and CloudWatch.


βœ”οΈ 2. Define Standard Business Metrics

Examples:

  • orders_created_total
  • failed_logins_total
  • payment_latency_seconds

Business KPIs help identify anomalies beyond pure infrastructure.


βœ”οΈ 3. Trace End-to-End Requests

Use OpenTelemetry to instrument services.
Traces give visibility across microservices.


βœ”οΈ 4. Set SLOs and Alert Policies

For example:

  • SLO: 99% of HTTP requests < 150ms
  • Alert: More than 5% errors in 5 minutes

Good alerts are meaningful, not noisy.


βœ”οΈ 5. Centralize All Telemetry Data

Use one platform for metrics + logs + traces.
Examples:

  • Datadog
  • New Relic
  • Grafana Cloud
  • AWS CloudWatch

πŸ› οΈ Real-World Example: Observability With Prometheus + Grafana + ELK Stack

Below is a complete example using a simple Node.js API instrumented with Prometheus metrics and ELK logging.


πŸ“Œ Example Application (Node.js)

We will expose:

  • /metrics endpoint for Prometheus scraping
  • Structured logs sent to Logstash (ELK stack)
  • A simple API endpoint

πŸ”§ Step 1 β€” Install Dependencies

npm install express prom-client winston winston-elasticsearch
Enter fullscreen mode Exit fullscreen mode

πŸ§ͺ Step 2 β€” Node.js Code With Observability

// app.js
const express = require("express");
const client = require("prom-client");
const winston = require("winston");
const { ElasticsearchTransport } = require("winston-elasticsearch");

const app = express();
const register = new client.Registry();

// ----- PROMETHEUS METRICS -----
const httpRequestCounter = new client.Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "route", "status"]
});
register.registerMetric(httpRequestCounter);

const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request latency",
  buckets: [0.1, 0.3, 0.5, 1, 2] 
});
register.registerMetric(httpRequestDuration);

client.collectDefaultMetrics({ register });

// ----- ELK LOGGING -----
const esTransportOpts = {
  level: "info",
  clientOpts: { node: "http://localhost:9200" }
};

const logger = winston.createLogger({
  transports: [new ElasticsearchTransport(esTransportOpts)]
});

// ----- NORMAL API ROUTE -----
app.get("/api/orders", async (req, res) => {
  const end = httpRequestDuration.startTimer();

  logger.info({
    message: "Fetching orders",
    service: "orders-api",
    environment: "production",
    timestamp: new Date().toISOString()
  });

  res.json({ orderId: 123, status: "OK" });

  httpRequestCounter.inc({
    method: "GET",
    route: "/api/orders",
    status: 200
  });

  end();
});

// ----- METRICS ENDPOINT FOR PROMETHEUS -----
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.send(await register.metrics());
});

// ----- START SERVER -----
app.listen(3000, () => console.log("API running on port 3000"));
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Step 3 β€” Prometheus Scrape Configuration

Add to prometheus.yml:

scrape_configs:
  - job_name: "orders-api"
    static_configs:
      - targets: ["localhost:3000"]
Enter fullscreen mode Exit fullscreen mode

Prometheus now collects:

  • request count
  • request latency
  • default node metrics

πŸ“ˆ Step 4 β€” Grafana Dashboard

Grafana automatically detects metrics from Prometheus and you can plot:

  • Requests per second
  • Error rate
  • Latency percentiles (P95, P99)

Create a panel and use this query:

rate(http_requests_total[5m])
Enter fullscreen mode Exit fullscreen mode

Another useful panel for latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Enter fullscreen mode Exit fullscreen mode

πŸ“ Step 5 β€” ELK Stack for Logs

Logs go to Elasticsearch, and you can search them with Kibana:

Example search query:

service: "orders-api" AND level: "info"
Enter fullscreen mode Exit fullscreen mode

Example rendered log:

{
  "message": "Fetching orders",
  "service": "orders-api",
  "environment": "production",
  "timestamp": "2025-11-23T10:20:34.123Z"
}
Enter fullscreen mode Exit fullscreen mode

πŸš€ Final Thoughts

Modern systems demand more than simple health checksβ€”
they require full observability to identify, diagnose, and prevent problems in production.

By applying:

  • structured logs
  • custom metrics
  • distributed tracing
  • centralized dashboards
  • SLO-driven alerting

you significantly increase reliability, and most importantly, you gain the ability to understand your system deeply.

Top comments (0)