Building a Practical Observability Foundation for a Legacy Web Service
Building a Practical Observability Foundation for a Legacy Web Service
If you’re maintaining a legacy web service, you’ve likely inherited a tangled mix of monolithic code, ad-hoc dashboards, and fragile deployments. The result: slow incident response, brittle releases, and limited visibility into how users actually experience the system. This tutorial walks you through establishing a pragmatic, low-friction observability foundation that improves reliability, accelerates debugging, and scales with your team.
Outline
- Define goals and constraints for observability
- Instrumentation strategy: what to measure and how
- Centralized logging and structured events
- Tracing across a legacy stack
- Metrics, dashboards, and alerting
- Incident response workflow and runbooks
- Practical code examples across a typical stack
- Gradual rollout plan to avoid destabilizing the system
- Optional: lightweight SRE practices for small teams
1) Define goals and constraints
Before touching code, agree on what “observability” must deliver for your context.
- Core goals
- Faster mean time to detect and repair (MTTD/MTTR)
- Clear visibility into end-to-end user flows
- Confidence in releases with minimal regressions
- Ability to triage incidents without pinning to a single log file
- Constraints
- Limited engineering time and risk tolerance
- Legacy tech stack (e.g., old web server, on-prem logs, few/limited async traces)
- Compliance considerations (PII handling, data retention)
- Success metrics you can track
- Incident response time (time from alert to remediation)
- Error rate by endpoint
- P95/99 latency per critical user path
- Log completeness (key events present per request)
2) Instrumentation strategy
A practical approach blends three layers: lightweight telemetry, structured events, and selective tracing.
- Lightweight telemetry (non-intrusive)
- Add minimal timing around critical code paths (start-end timestamps)
- Capture request metadata: endpoint, method, user id (if available), and route
- Record simple counters for outcomes: success, client error, server error
- Structured events
- Emit JSON/log line records with a stable schema:
- timestamp, service, environment, host
- event_type (request_start, request_end, db_query, cache_mg, error)
- attributes: path, duration_ms, status_code, error_code, user_id, correlation_id
- Encourage adding a correlation_id per request to stitch traces
- Selective tracing
- For critical user journeys (e.g., checkout, sign-in), enable lightweight tracing with a correlation_id
- If you can’t trace end-to-end, at least trace orchestration points (service A -> service B -> database)
3) Centralized logging and structured events
- Choose a log transport strategy
- Ship logs to a central log sink (e.g., ELK/EFK, Loki, or cloud-native options)
- Prefer structured logs (JSON lines) over plain text for easier querying
- Normalize log schemas
- Use a consistent log field set across services:
- timestamp, service, environment, host, level
- correlation_id, trace_id (if available)
- message, endpoint/path, status_code, duration_ms, error
-
Example: emitting a structured log line in Node.js
- Install a simple logger (pino is popular for performance and JSON output)
- Code snippet: const pino = require('pino'); const logger = pino({ level: process.env.LOG_LEVEL || 'info' });
function handleRequest(req, res, next) {
const correlationId = req.headers['x-correlation-id'] || generateId();
res.setHeader('X-Correlation-Id', correlationId);
const start = Date.now();// Attach context for downstream logs
req.context = { correlationId, start };logger.info({ event: 'request_start', path: req.path, method: req.method, correlationId });
res.on('finish', () => {
const duration = Date.now() - start;
logger.info({
event: 'request_end',
path: req.path,
method: req.method,
statusCode: res.statusCode,
duration_ms: duration,
correlationId
});
});next();
} -
Log retention and privacy
- Rotate and archive logs regularly
- Redact or omit PII where possible; never log passwords or tokens
- Implement a data retention policy aligned with compliance
4) Tracing across a legacy stack
- When full distributed tracing isn’t feasible, implement pragmatic trace contexts
- Use a simple trace_id passed via headers or correlation_id
- Propagate trace_id across service boundaries (HTTP, message queues, DB calls where possible)
- Map end-to-end flows
- Identify critical user journeys and create lightweight traces for them
- Instrument at key boundaries:
- API gateway / ingress
- Authentication service
- Core business service(s)
- Database/query layer
- Visualize traces
- If you lack a distributed tracing system, at minimum build a dashboard that displays time spent in each component for a given trace_id
- Example: show total journey time, latency per service, error rates per step
5) Metrics, dashboards, and alerting
- Core metrics to collect
- Requests per second (RPS) per endpoint
- Error rate per endpoint (5xx/4xx ratio)
- Latency percentiles (p50, p95, p99) per critical path
- Database query durations and queue times
- Cache hit/miss rates
- Dashboards
- Incident-ready view: one page showing current health (uptime proxy, error rate spike, latency spike)
- Journey-focused dashboards for critical flows
- Operational dashboards for infra (CPU, memory, disk, network)
- Alerts
- Simple, non-noisy alerts are essential
- Examples:
- 5xx error rate > threshold for more than X minutes
- P95 latency for critical path > threshold
- Correlation_id not propagated between services (indicating instrumentation gap)
- Use multiple alert tiers (critical for outages, warning for degradations)
6) Incident response workflow
- Create a lightweight runbook
- When alert fires, steps:
- Check for recent deployments or config changes
- Filter logs by correlation_id to reconstruct the request path
- Identify the first failing component and root cause hypothesis
- Validate with a quick manual test or synthetic transaction if feasible
- Collaboration and ownership
- Assign on-call roles and a clear escalation path
- Ensure runbooks are stored in a shareable, version-controlled location
- Post-incident
- Post-mortem with scope, cause, impact, remediation, and prevention plan
- Update dashboards and instrumentation gaps as needed
7) Practical code examples across a typical stack
Assume a small three-tier legacy stack: a Node.js API server, a PostgreSQL database, and a Redis cache. We’ll add small instrumentation pieces.
-
Node.js API server: add structured logs and correlation
- Install packages: pino, pg, redis
- Example middleware to inject correlation and log requests const pino = require('pino'); const logger = pino({ level: process.env.LOG_LEVEL || 'info' });
function withCorrelationId(req, res, next) {
const correlationId = req.headers['x-correlation-id'] || require('crypto').randomBytes(8).toString('hex');
req.correlationId = correlationId;
res.setHeader('X-Correlation-Id', correlationId);
next();
}async function dbQuery(sql, params, client) {
const start = Date.now();
try {
const result = await client.query(sql, params);
logger.info({ event: 'db_query', sql, duration_ms: Date.now() - start, correlationId: this.correlationId });
return result;
} catch (err) {
logger.error({ event: 'db_error', sql, error: err.message, duration_ms: Date.now() - start, correlationId: this.correlationId });
throw err;
}
}// Usage in route
app.use(withCorrelationId);app.get('/users/:id', async (req, res) => {
const client = await pgPool.connect();
try {
logger.info({ event: 'request_start', path: req.path, method: req.method, correlationId: req.correlationId });
const userRes = await dbQuery.call(req, 'SELECT id, name FROM users WHERE id=$1', [req.params.id], client);
res.json(userRes.rows);
} catch (e) {
res.status(500).json({ error: 'Internal Server Error' });
} finally {
client.release();
logger.info({ event: 'request_end', path: req.path, statusCode: res.statusCode, correlationId: req.correlationId });
}
}); -
PostgreSQL monitoring
- Enable an application-side query log by wrapping queries and emitting a log line with duration
- Use pg_stat_statements (if available) for long-term query insights
-
Redis cache instrumentation
- Wrap get/set with timing and cache hit/miss logs async function cacheGet(key) { const start = Date.now(); const value = await redisClient.get(key); const duration = Date.now() - start; if (value) { logger.info({ event: 'cache_hit', key, duration_ms: duration, correlationId: currentCorrelation() }); } else { logger.info({ event: 'cache_miss', key, duration_ms: duration, correlationId: currentCorrelation() }); } return value; }
-
Optional: lightweight tracing using correlation_id
- If you want tracing across services, propagate X-Correlation-Id header and include it in logs: logger.info({ event: 'service_call', service: 'auth', correlationId: req.correlationId, duration_ms: elapsed });
8) Gradual rollout plan
- Start small and non-disruptive
- Add correlation_id propagation and structured logging to one service first
- Build a local dashboard for that service and confirm logs reach the central store
- Expand instrumentation incrementally
- Instrument one critical path end-to-end
- Introduce basic metrics (latency, error rate) and a simple alert
- Reuse existing tools
- If you have a SIEM or log aggregator, leverage it before introducing new tooling
- Decommission gradually
- As you gain confidence, prune verbose logs and avoid excessive logging in hot paths
9) Optional: lightweight SRE practices for small teams
- Define a small service level objective (SLO)
- Example: 99th percentile latency under 800 ms for the checkout path
- Maintain error budgets
- If error budget is exhausted, limit changes other than critical fixes
- Keep runbooks living documents
- Store them in version control; rehearse on a quarterly basis
- Automate least amount necessary
- Automate deployment checks that verify instrumentation is alive after a release
Illustrative example: end-to-end workflow
- A user visits /checkout
- The API service handles the request, logs request_start
- It queries PostgreSQL for inventory and creates an order
- It caches some results in Redis
- The response includes a correlation_id for downstream tracing
- If an error occurs, the error is logged with the correlation_id, enabling rapid triage
If you’d like, I can tailor this plan to your exact tech stack (language, framework, hosting, and logging stack) and provide a concrete rollout schedule with checklist items and sample dashboards. Would you share details about your current stack, logging system, and any existing monitoring you’re already using?
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)