DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

Building a Practical Observability Foundation for a Legacy Web Service

Building a Practical Observability Foundation for a Legacy Web Service

Building a Practical Observability Foundation for a Legacy Web Service

If you’re maintaining a legacy web service, you’ve likely inherited a tangled mix of monolithic code, ad-hoc dashboards, and fragile deployments. The result: slow incident response, brittle releases, and limited visibility into how users actually experience the system. This tutorial walks you through establishing a pragmatic, low-friction observability foundation that improves reliability, accelerates debugging, and scales with your team.

Outline

  • Define goals and constraints for observability
  • Instrumentation strategy: what to measure and how
  • Centralized logging and structured events
  • Tracing across a legacy stack
  • Metrics, dashboards, and alerting
  • Incident response workflow and runbooks
  • Practical code examples across a typical stack
  • Gradual rollout plan to avoid destabilizing the system
  • Optional: lightweight SRE practices for small teams

1) Define goals and constraints
Before touching code, agree on what “observability” must deliver for your context.

  • Core goals
    • Faster mean time to detect and repair (MTTD/MTTR)
    • Clear visibility into end-to-end user flows
    • Confidence in releases with minimal regressions
    • Ability to triage incidents without pinning to a single log file
  • Constraints
    • Limited engineering time and risk tolerance
    • Legacy tech stack (e.g., old web server, on-prem logs, few/limited async traces)
    • Compliance considerations (PII handling, data retention)
  • Success metrics you can track
    • Incident response time (time from alert to remediation)
    • Error rate by endpoint
    • P95/99 latency per critical user path
    • Log completeness (key events present per request)

2) Instrumentation strategy
A practical approach blends three layers: lightweight telemetry, structured events, and selective tracing.

  • Lightweight telemetry (non-intrusive)
    • Add minimal timing around critical code paths (start-end timestamps)
    • Capture request metadata: endpoint, method, user id (if available), and route
    • Record simple counters for outcomes: success, client error, server error
  • Structured events
    • Emit JSON/log line records with a stable schema:
    • timestamp, service, environment, host
    • event_type (request_start, request_end, db_query, cache_mg, error)
    • attributes: path, duration_ms, status_code, error_code, user_id, correlation_id
    • Encourage adding a correlation_id per request to stitch traces
  • Selective tracing
    • For critical user journeys (e.g., checkout, sign-in), enable lightweight tracing with a correlation_id
    • If you can’t trace end-to-end, at least trace orchestration points (service A -> service B -> database)

3) Centralized logging and structured events

  • Choose a log transport strategy
    • Ship logs to a central log sink (e.g., ELK/EFK, Loki, or cloud-native options)
    • Prefer structured logs (JSON lines) over plain text for easier querying
  • Normalize log schemas
    • Use a consistent log field set across services:
    • timestamp, service, environment, host, level
    • correlation_id, trace_id (if available)
    • message, endpoint/path, status_code, duration_ms, error
  • Example: emitting a structured log line in Node.js

    • Install a simple logger (pino is popular for performance and JSON output)
    • Code snippet: const pino = require('pino'); const logger = pino({ level: process.env.LOG_LEVEL || 'info' });

    function handleRequest(req, res, next) {
    const correlationId = req.headers['x-correlation-id'] || generateId();
    res.setHeader('X-Correlation-Id', correlationId);
    const start = Date.now();

    // Attach context for downstream logs
    req.context = { correlationId, start };

    logger.info({ event: 'request_start', path: req.path, method: req.method, correlationId });

    res.on('finish', () => {
    const duration = Date.now() - start;
    logger.info({
    event: 'request_end',
    path: req.path,
    method: req.method,
    statusCode: res.statusCode,
    duration_ms: duration,
    correlationId
    });
    });

    next();
    }

  • Log retention and privacy

    • Rotate and archive logs regularly
    • Redact or omit PII where possible; never log passwords or tokens
    • Implement a data retention policy aligned with compliance

4) Tracing across a legacy stack

  • When full distributed tracing isn’t feasible, implement pragmatic trace contexts
    • Use a simple trace_id passed via headers or correlation_id
    • Propagate trace_id across service boundaries (HTTP, message queues, DB calls where possible)
  • Map end-to-end flows
    • Identify critical user journeys and create lightweight traces for them
    • Instrument at key boundaries:
    • API gateway / ingress
    • Authentication service
    • Core business service(s)
    • Database/query layer
  • Visualize traces
    • If you lack a distributed tracing system, at minimum build a dashboard that displays time spent in each component for a given trace_id
    • Example: show total journey time, latency per service, error rates per step

5) Metrics, dashboards, and alerting

  • Core metrics to collect
    • Requests per second (RPS) per endpoint
    • Error rate per endpoint (5xx/4xx ratio)
    • Latency percentiles (p50, p95, p99) per critical path
    • Database query durations and queue times
    • Cache hit/miss rates
  • Dashboards
    • Incident-ready view: one page showing current health (uptime proxy, error rate spike, latency spike)
    • Journey-focused dashboards for critical flows
    • Operational dashboards for infra (CPU, memory, disk, network)
  • Alerts
    • Simple, non-noisy alerts are essential
    • Examples:
    • 5xx error rate > threshold for more than X minutes
    • P95 latency for critical path > threshold
    • Correlation_id not propagated between services (indicating instrumentation gap)
    • Use multiple alert tiers (critical for outages, warning for degradations)

6) Incident response workflow

  • Create a lightweight runbook
    • When alert fires, steps:
    • Check for recent deployments or config changes
    • Filter logs by correlation_id to reconstruct the request path
    • Identify the first failing component and root cause hypothesis
    • Validate with a quick manual test or synthetic transaction if feasible
  • Collaboration and ownership
    • Assign on-call roles and a clear escalation path
    • Ensure runbooks are stored in a shareable, version-controlled location
  • Post-incident
    • Post-mortem with scope, cause, impact, remediation, and prevention plan
    • Update dashboards and instrumentation gaps as needed

7) Practical code examples across a typical stack
Assume a small three-tier legacy stack: a Node.js API server, a PostgreSQL database, and a Redis cache. We’ll add small instrumentation pieces.

  • Node.js API server: add structured logs and correlation

    • Install packages: pino, pg, redis
    • Example middleware to inject correlation and log requests const pino = require('pino'); const logger = pino({ level: process.env.LOG_LEVEL || 'info' });

    function withCorrelationId(req, res, next) {
    const correlationId = req.headers['x-correlation-id'] || require('crypto').randomBytes(8).toString('hex');
    req.correlationId = correlationId;
    res.setHeader('X-Correlation-Id', correlationId);
    next();
    }

    async function dbQuery(sql, params, client) {
    const start = Date.now();
    try {
    const result = await client.query(sql, params);
    logger.info({ event: 'db_query', sql, duration_ms: Date.now() - start, correlationId: this.correlationId });
    return result;
    } catch (err) {
    logger.error({ event: 'db_error', sql, error: err.message, duration_ms: Date.now() - start, correlationId: this.correlationId });
    throw err;
    }
    }

    // Usage in route
    app.use(withCorrelationId);

    app.get('/users/:id', async (req, res) => {
    const client = await pgPool.connect();
    try {
    logger.info({ event: 'request_start', path: req.path, method: req.method, correlationId: req.correlationId });
    const userRes = await dbQuery.call(req, 'SELECT id, name FROM users WHERE id=$1', [req.params.id], client);
    res.json(userRes.rows);
    } catch (e) {
    res.status(500).json({ error: 'Internal Server Error' });
    } finally {
    client.release();
    logger.info({ event: 'request_end', path: req.path, statusCode: res.statusCode, correlationId: req.correlationId });
    }
    });

  • PostgreSQL monitoring

    • Enable an application-side query log by wrapping queries and emitting a log line with duration
    • Use pg_stat_statements (if available) for long-term query insights
  • Redis cache instrumentation

    • Wrap get/set with timing and cache hit/miss logs async function cacheGet(key) { const start = Date.now(); const value = await redisClient.get(key); const duration = Date.now() - start; if (value) { logger.info({ event: 'cache_hit', key, duration_ms: duration, correlationId: currentCorrelation() }); } else { logger.info({ event: 'cache_miss', key, duration_ms: duration, correlationId: currentCorrelation() }); } return value; }
  • Optional: lightweight tracing using correlation_id

    • If you want tracing across services, propagate X-Correlation-Id header and include it in logs: logger.info({ event: 'service_call', service: 'auth', correlationId: req.correlationId, duration_ms: elapsed });

8) Gradual rollout plan

  • Start small and non-disruptive
    • Add correlation_id propagation and structured logging to one service first
    • Build a local dashboard for that service and confirm logs reach the central store
  • Expand instrumentation incrementally
    • Instrument one critical path end-to-end
    • Introduce basic metrics (latency, error rate) and a simple alert
  • Reuse existing tools
    • If you have a SIEM or log aggregator, leverage it before introducing new tooling
  • Decommission gradually
    • As you gain confidence, prune verbose logs and avoid excessive logging in hot paths

9) Optional: lightweight SRE practices for small teams

  • Define a small service level objective (SLO)
    • Example: 99th percentile latency under 800 ms for the checkout path
  • Maintain error budgets
    • If error budget is exhausted, limit changes other than critical fixes
  • Keep runbooks living documents
    • Store them in version control; rehearse on a quarterly basis
  • Automate least amount necessary
    • Automate deployment checks that verify instrumentation is alive after a release

Illustrative example: end-to-end workflow

  • A user visits /checkout
  • The API service handles the request, logs request_start
  • It queries PostgreSQL for inventory and creates an order
  • It caches some results in Redis
  • The response includes a correlation_id for downstream tracing
  • If an error occurs, the error is logged with the correlation_id, enabling rapid triage

If you’d like, I can tailor this plan to your exact tech stack (language, framework, hosting, and logging stack) and provide a concrete rollout schedule with checklist items and sample dashboards. Would you share details about your current stack, logging system, and any existing monitoring you’re already using?

-

Rizwan Saleem | https://rizwansaleem.co

Top comments (0)