Structured Logging: Why log!() Is Killing Your On-Call Experience

#rust #observability #backend #architecture

"String concatenation in your logs is not just noisy—it is the single biggest contributor to slow mean time to resolve (MTTR) during production incidents."

What We're Building

We are moving from a legacy logging implementation where errors are printed as opaque strings to a robust, structured logging pipeline. In this design, every log entry is a typed object containing a severity level, a standardized message, dynamic context (request ID, user, error type), and timestamp. The goal is to eliminate "magic strings" that vanish into a black box, ensuring that when an on-call engineer opens a terminal, they immediately see the exact request payload and state that caused the failure.

Step 1 — Define a Log Schema

Before emitting data, you must enforce a contract on what a log entry looks like. This prevents developers from accidentally logging sensitive data or omitting crucial fields like trace_id. In Rust, we use a struct with serde serialization to ensure the JSON output matches a schema.

use tracing_subscriber::EnvFilter;
use serde::{Deserialize, Serialize};

#[derive(Debug, Serialize, Deserialize)]
pub struct LogEntry {
    pub level: String,
    pub message: String,
    pub trace_id: String,
    #[serde(flatten)]
    pub context: Context,
}

Type safety ensures fields exist at runtime and prevents typos in field names that break log aggregation pipelines.

Step 2 — Contextualize Failures

A log entry without context is useless during an incident. You must capture dynamic metadata such as tenant_id, user_agent, and client_ip at the time of the request, not by searching a database later.

pub struct Context {
    pub tenant_id: String,
    pub user_id: Option<String>,
    pub client_ip: String,
}

fn handle_request(ctx: Context, result: Result<(), Error>) {
    if let Err(e) = result {
        log(&LogEntry {
            level: "ERROR".to_string(),
            message: e.to_string(),
            trace_id: req.trace_id(),
            context: ctx,
        });
    }
}

This isolates issues to specific tenants or users instantly without needing to query external metadata stores during troubleshooting.

Step 3 — Correlate Requests

In a microservices architecture, a single logical request flows through five or six services. Without correlation IDs, you cannot correlate the error log in Service A to the upstream API call in Service B. You must inject a trace ID into every outbound HTTP header.

User -> API Gateway -> [trace_id: a1b2c3] -> Service A -> [trace_id: a1b2c3] -> Service B

#[async_trait]
pub trait ContextCarrier {
    async fn inject(&mut self, trace_id: &str);
    async fn extract(&self) -> String;
}

This enables end-to-end visibility across distributed systems, allowing engineers to stitch logs from different containers into a single timeline.

Step 4 — Avoid Eager Evaluation

Log flushing can be expensive. If a logging call blocks the request handler, it introduces latency under load. You must configure the logger to use an asynchronous queue that batches entries before writing to disk or UDP.

use tracing_subscriber::fmt::format;
use tracing_subscriber::layer::Layer;

pub struct AsyncLogLayer {
    pub batch_size: usize,
}

impl<S: tracing::Subscriber> Layer<S> for AsyncLogLayer {
    // Implementation handles async flushing without blocking the caller
}

Preventing log flushing from blocking request handlers ensures that observability overhead does not impact service availability or SLAs.

Key Takeaways

Schema Enforcement: Using typed structs guarantees consistent field names across the codebase, preventing data loss during aggregation.
Context Propagation: Attaching dynamic metadata like tenant_id allows immediate filtering of logs for specific customer accounts without database queries.
Async Flushing: Batching and asynchronous writes prevent logging from becoming a bottleneck that degrades performance under high load.
Trace ID Consistency: Injecting a global identifier ensures that logs from different microservices can be linked in a single investigation timeline.

What's Next?

Implement distributed tracing standards like OpenTelemetry to automatically propagate trace IDs across all services.
Configure log aggregation tools like Loki or Elasticsearch to index structured fields for efficient SQL-like queries.
Set up alerting rules that detect specific log patterns or high error rates for the ERROR level specifically.