CESAR NIKOLAS CAMAC MELENDEZ

Posted on Nov 24

🔍 Observability Practices: A Practical Guide With Real-World Examples

#microservices #programming #productivity #analytics

Modern software systems are more distributed, dynamic, and complex than ever. Microservices, serverless functions, containers, and event-driven architectures make traditional monitoring insufficient.
To understand what is actually happening inside your application, you need observability.

This article explains the foundational pillars of observability, best practices, common tools, and includes a hands-on real-world example using Grafana + Prometheus for metrics and ELK Stack for logs.

🚦 What Is Observability?

Observability is the ability to understand the internal state of a system based solely on the data it produces, such as metrics, logs, and traces.

Unlike classical monitoring, which answers “Is the system up?”, observability answers:

Why is the system slow?
Where is the latency coming from?
Which dependency failed?
What changed recently that caused errors?

📊 The Three Pillars of Observability

1. Metrics

Numeric measurements collected over time.
Examples:

CPU usage
Request throughput
Error rate
Database latency

Tools: Prometheus, Grafana, Datadog Metrics, AWS CloudWatch Metrics.

2. Logs

Immutable records about events happening in the system.
Examples:

Info logs
Warnings
Errors
Audit logs

Tools: ELK Stack, Datadog Logs, Azure Log Analytics, Loki.

3. Traces

A trace follows a single request across distributed components.
Examples:

Microservices call chain
Latency across services
Errors from downstream dependencies

Tools: Jaeger, Zipkin, OpenTelemetry, AWS X-Ray, New Relic.

🧱 Core Observability Practices

✔️ 1. Use Structured Logging

Instead of plain text logs, use JSON.

{
  "timestamp": "2025-11-23T10:15:20Z",
  "level": "ERROR",
  "service": "orders-api",
  "message": "Payment gateway timeout",
  "orderId": 10422,
  "durationMs": 3200
}

Structured logs allow better filtering and search on platforms like ELK, Datadog, and CloudWatch.

✔️ 2. Define Standard Business Metrics

Examples:

orders_created_total
failed_logins_total
payment_latency_seconds

Business KPIs help identify anomalies beyond pure infrastructure.

✔️ 3. Trace End-to-End Requests

Use OpenTelemetry to instrument services.
Traces give visibility across microservices.

✔️ 4. Set SLOs and Alert Policies

For example:

SLO: 99% of HTTP requests < 150ms
Alert: More than 5% errors in 5 minutes

Good alerts are meaningful, not noisy.

✔️ 5. Centralize All Telemetry Data

Use one platform for metrics + logs + traces.
Examples:

Datadog
New Relic
Grafana Cloud
AWS CloudWatch

🛠️ Real-World Example: Observability With Prometheus + Grafana + ELK Stack

Below is a complete example using a simple Node.js API instrumented with Prometheus metrics and ELK logging.

📌 Example Application (Node.js)

We will expose:

/metrics endpoint for Prometheus scraping
Structured logs sent to Logstash (ELK stack)
A simple API endpoint

🔧 Step 1 — Install Dependencies

npm install express prom-client winston winston-elasticsearch

🧪 Step 2 — Node.js Code With Observability

// app.js
const express = require("express");
const client = require("prom-client");
const winston = require("winston");
const { ElasticsearchTransport } = require("winston-elasticsearch");

const app = express();
const register = new client.Registry();

// ----- PROMETHEUS METRICS -----
const httpRequestCounter = new client.Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "route", "status"]
});
register.registerMetric(httpRequestCounter);

const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request latency",
  buckets: [0.1, 0.3, 0.5, 1, 2] 
});
register.registerMetric(httpRequestDuration);

client.collectDefaultMetrics({ register });

// ----- ELK LOGGING -----
const esTransportOpts = {
  level: "info",
  clientOpts: { node: "http://localhost:9200" }
};

const logger = winston.createLogger({
  transports: [new ElasticsearchTransport(esTransportOpts)]
});

// ----- NORMAL API ROUTE -----
app.get("/api/orders", async (req, res) => {
  const end = httpRequestDuration.startTimer();

  logger.info({
    message: "Fetching orders",
    service: "orders-api",
    environment: "production",
    timestamp: new Date().toISOString()
  });

  res.json({ orderId: 123, status: "OK" });

  httpRequestCounter.inc({
    method: "GET",
    route: "/api/orders",
    status: 200
  });

  end();
});

// ----- METRICS ENDPOINT FOR PROMETHEUS -----
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.send(await register.metrics());
});

// ----- START SERVER -----
app.listen(3000, () => console.log("API running on port 3000"));

📊 Step 3 — Prometheus Scrape Configuration

Add to prometheus.yml:

scrape_configs:
  - job_name: "orders-api"
    static_configs:
      - targets: ["localhost:3000"]

Prometheus now collects:

request count
request latency
default node metrics

📈 Step 4 — Grafana Dashboard

Grafana automatically detects metrics from Prometheus and you can plot:

Requests per second
Error rate
Latency percentiles (P95, P99)

Create a panel and use this query:

rate(http_requests_total[5m])

Another useful panel for latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

📝 Step 5 — ELK Stack for Logs

Logs go to Elasticsearch, and you can search them with Kibana:

Example search query:

service: "orders-api" AND level: "info"

Example rendered log:

{
  "message": "Fetching orders",
  "service": "orders-api",
  "environment": "production",
  "timestamp": "2025-11-23T10:20:34.123Z"
}

🚀 Final Thoughts

Modern systems demand more than simple health checks—
they require full observability to identify, diagnose, and prevent problems in production.

By applying:

structured logs
custom metrics
distributed tracing
centralized dashboards
SLO-driven alerting

you significantly increase reliability, and most importantly, you gain the ability to understand your system deeply.

DEV Community