Matthias Bruns

Posted on Mar 25 • Originally published at appetizers.io

Observability in Production: Metrics, Traces, and Logs That Actually Matter

#observability #monitoring #cloudnative #devops

Production systems fail. That's not pessimism—it's reality. The question isn't whether your cloud-native applications will encounter issues, but whether you'll be able to diagnose and fix them before they impact users. This is where observability becomes critical, moving beyond simple monitoring to provide deep insights into system behavior.

Observability in software systems is fundamentally about understanding internal state through external outputs. Unlike traditional monitoring that tells you what happened, observability helps you understand why it happened. For teams running microservices on Kubernetes, this distinction can mean the difference between a five-minute fix and a three-hour war room session.

The Three Pillars: More Than Marketing Buzzwords

The industry has standardized on three pillars of observability: metrics, logs, and traces. But as the Kubernetes documentation notes, these aren't just categories—they're complementary data sources that together provide a complete picture of system health.

Metrics give you the quantitative data: response times, error rates, resource utilization. They're your system's vital signs.

Logs provide the qualitative context: what happened, when it happened, and often why it happened. They're your system's diary.

Traces show you the journey: how a request flows through your distributed system, where it slows down, and where it fails. They're your system's GPS.

Here's what makes this powerful: each pillar compensates for the others' weaknesses. Metrics are efficient but lack context. Logs provide context but can be overwhelming. Traces show relationships but generate massive data volumes.

Metrics That Actually Drive Decisions

Most teams collect too many metrics and act on too few. The key is identifying metrics that directly correlate with user experience and business outcomes.

The Four Golden Signals

Start with Google's Four Golden Signals, adapted for your specific context:

Latency: How long requests take to complete
Traffic: How many requests you're handling
Errors: How many requests are failing
Saturation: How close to capacity your resources are

For a Kubernetes-based e-commerce API, this might look like:

# Prometheus recording rules example
groups:
- name: ecommerce_sli
  rules:
  - record: http_request_duration_seconds:rate5m
    expr: rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

  - record: http_requests:rate5m
    expr: sum(rate(http_requests_total[5m])) by (service, method)

  - record: http_requests_errors:rate5m
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

  - record: pod_cpu_utilization
    expr: rate(container_cpu_usage_seconds_total[5m]) / container_spec_cpu_quota * 100

Business-Specific Metrics

Beyond infrastructure metrics, track what matters to your business:

Conversion funnel metrics: Cart additions, checkout completions, payment successes
Feature adoption: New feature usage rates, user engagement depth
Revenue impact: Transaction volumes, average order values, failed payment rates

The goal is creating a direct line from technical metrics to business impact. When CPU utilization spikes, you should immediately know whether it's affecting checkout completion rates.

Structured Logging: Your Debug Lifeline

According to Stack Overflow's analysis, developer productivity increases significantly when engineers can jump directly to the root cause instead of hunting across multiple systems. Structured logging is fundamental to this capability.

JSON All the Things

Structured logs in JSON format enable powerful querying and correlation:

// Bad: Unstructured logging
console.log("User john_doe failed to complete checkout for order 12345");

// Good: Structured logging
logger.info({
  event: "checkout_failed",
  user_id: "john_doe",
  order_id: "12345",
  error_code: "PAYMENT_DECLINED",
  payment_method: "credit_card",
  cart_value: 89.99,
  session_id: "sess_abc123",
  trace_id: "trace_xyz789"
});

Context Propagation

Every log entry should include enough context to understand the request flow:

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json(),
    winston.format.printf(({ timestamp, level, message, ...meta }) => {
      return JSON.stringify({
        timestamp,
        level,
        message,
        service: process.env.SERVICE_NAME,
        version: process.env.SERVICE_VERSION,
        trace_id: meta.trace_id,
        span_id: meta.span_id,
        user_id: meta.user_id,
        ...meta
      });
    })
  )
});

Log Levels That Make Sense

Use log levels strategically:

ERROR: Something broke that requires immediate attention
WARN: Something unexpected happened but the system recovered
INFO: Normal business operations (user actions, external API calls)
DEBUG: Detailed information for troubleshooting (disabled in production)

Distributed Tracing: Following the Breadcrumbs

In microservices architectures, a single user request might touch dozens of services. Distributed tracing connects these interactions, showing you exactly where requests slow down or fail.

OpenTelemetry Implementation

OpenTelemetry has become the standard for instrumentation. Here's how to add tracing to a Node.js service:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

const sdk = new NodeSDK({
  traceExporter: new JaegerExporter({
    endpoint: process.env.JAEGER_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-http': {
      requestHook: (span, request) => {
        span.setAttributes({
          'user.id': request.headers['x-user-id'],
          'request.size': request.headers['content-length']
        });
      }
    }
  })]
});

sdk.start();

Custom Spans for Business Logic

Auto-instrumentation covers HTTP calls and database queries, but add custom spans for critical business operations:

const opentelemetry = require('@opentelemetry/api');

async function processPayment(orderId, paymentDetails) {
  const tracer = opentelemetry.trace.getTracer('payment-service');

  return tracer.startActiveSpan('process_payment', async (span) => {
    span.setAttributes({
      'order.id': orderId,
      'payment.method': paymentDetails.method,
      'payment.amount': paymentDetails.amount
    });

    try {
      const result = await paymentGateway.charge(paymentDetails);
      span.setAttributes({
        'payment.status': result.status,
        'payment.transaction_id': result.transactionId
      });
      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: opentelemetry.SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  });
}

Kubernetes-Specific Observability

Kubernetes observability requires understanding both application behavior and cluster health. The platform's dynamic nature—pods starting, stopping, and moving—adds complexity that traditional monitoring approaches can't handle.

Pod and Node Metrics

Monitor resource utilization and availability:

# Prometheus scrape config for Kubernetes metrics
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)

Service Mesh Integration

If you're using a service mesh like Istio, leverage its built-in observability:

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: custom-metrics
spec:
  metrics:
  - providers:
    - name: prometheus
  - overrides:
    - match:
        metric: ALL_METRICS
      tagOverrides:
        request_id:
          operation: UPSERT
          value: "%{REQUEST_ID}"

Application Performance in Kubernetes Context

Traditional APM tools often miss Kubernetes-specific context. Ensure your observability stack correlates application performance with:

Pod restarts and scheduling events
Resource limits and requests
Network policies and service mesh configuration
ConfigMap and Secret changes

Building Effective Dashboards

Dashboards should tell a story, not just display data. Structure them around user journeys and system flows.

The Inverted Pyramid Approach

Start with high-level business metrics, then drill down:

Business KPIs: Revenue, conversion rates, user satisfaction
Service-level indicators: Request rates, error rates, latencies
Infrastructure metrics: CPU, memory, network, storage
Detailed diagnostics: Individual service performance, database queries

Alert Fatigue is Real

As Splunk notes, the goal is reducing time to resolution, not increasing alert volume. Design alerts that require action:

# Good: Actionable alert
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 2m
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}"
    runbook_url: "https://wiki.company.com/runbooks/high-error-rate"

# Bad: Noisy alert
- alert: HighCPU
  expr: cpu_usage > 80
  for: 30s

Observability-Driven Development

Wikipedia describes observability-driven development as shipping features with custom instrumentation. This means thinking about observability during development, not after deployment.

Instrumentation as Code

Include observability requirements in your definition of done:

Every new API endpoint includes metrics, logging, and tracing
Business logic includes custom spans for critical operations
Error handling includes structured error logging
Feature flags include adoption and performance metrics

Testing Observability

Test your observability instrumentation like any other code:

describe('Payment Processing', () => {
  it('should create trace spans for payment operations', async () => {
    const mockTracer = new MockTracer();

    await processPayment('order-123', paymentDetails);

    const spans = mockTracer.report().spans;
    expect(spans).toHaveLength(3); // payment validation, gateway call, result processing
    expect(spans[0].operationName).toBe('validate_payment');
    expect(spans[1].operationName).toBe('gateway_charge');
    expect(spans[2].operationName).toBe('process_result');
  });
});

The Economics of Observability

Observability isn't free. Data ingestion, storage, and processing costs can quickly spiral out of control. Budget for roughly 5-15% of your infrastructure costs for observability tooling.

Sampling Strategies

For high-traffic services, implement intelligent sampling:

const sampler = {
  shouldSample: (context, traceId, spanName, spanKind, attributes) => {
    // Always sample errors
    if (attributes['http.status_code'] >= 400) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLE };
    }

    // Sample 1% of successful requests
    if (traceId[0] % 100 === 0) {
      return { decision: SamplingDecision.RECORD_AND_SAMPLE };
    }

    return { decision: SamplingDecision.NOT_RECORD };
  }
};

Data Retention Policies

Implement tiered retention:

High-resolution metrics: 7 days
Medium-resolution metrics: 30 days
Low-resolution metrics: 1 year
Traces: 7 days (with longer retention for errors)
Logs: 30 days (with longer retention for errors and security events)

Making Observability Actionable

The best observability setup is useless if it doesn't drive better outcomes. Microsoft's research on AI systems emphasizes that observability is foundational for operational control in production systems.

Runbooks and Automation

Every alert should link to a runbook that explains:

What the alert means
How to investigate the issue
Common causes and solutions
When to escalate

Better yet, automate responses where possible:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: auto-remediation
spec:
  entrypoint: high-error-rate-response
  templates:
  - name: high-error-rate-response
    steps:
    - - name: scale-up
        template: scale-deployment
        arguments:
          parameters:
          - name: deployment
            value: "{{workflow.parameters.deployment}}"
          - name: replicas
            value: "{{workflow.parameters.current-replicas * 2}}"

Continuous Improvement

Use observability data to drive architectural decisions:

Identify services that would benefit from caching
Spot database queries that need optimization
Find microservices boundaries that create unnecessary latency
Discover features that aren't being used and can be deprecated

The Path Forward

Observability isn't a destination—it's a capability that evolves with your systems. Start with the basics: structured logging, key metrics, and distributed tracing for critical paths. Build dashboards that tell stories, not just display data. Create alerts that drive action, not noise.

Most importantly, make observability a team responsibility, not just an operations concern. When developers can quickly understand how their code behaves in production, everyone wins: faster debugging, fewer outages, and systems that actually scale with confidence.

The complexity of cloud-native systems isn't going away. But with proper observability, that complexity becomes manageable, debuggable, and ultimately, a competitive advantage.

DEV Community