Patrick Londa for Bronto

Posted on May 15 • Originally published at bronto.io

Logging & Observability Best Practices from Bronto

#logging #observability #sre #devops

Authored by Conall Heffernan

Centralized logging is a good start to improving your log management — it allows collection, storage, and analysis from multiple sources in a single repository, making it easier to manage and access logs for dev, support, product, and SRE teams, as well as more easily meeting security and compliance requirements.

Having centralized your logs, the practices below will take you further. High-quality logs are the foundation of effective observability. Consistent, structured, and well-tagged log data allows teams to quickly identify performance issues, troubleshoot errors, and optimize cost and performance.

If AI is defined as the intersection of where intelligence meets data … data quality is key in an AI world.

In a world where AIs are starting to automate more and more, having clean, high-quality logs opens up the door to further automation and efficiencies — enabling additional benefits and new AI use cases.

This guide covers recommended best practices for log structure and context enrichment, correlation, agent configuration, team ownership, and log strategy.

1. Log Structure and Context

Tags, log metadata, and message attributes are all key–value pairs (KVPs), but they serve different purposes and live at different levels of your event stream:

Tags – Properties that apply to an entire stream of events (a dataset)
Log metadata – Properties added to individual log records, typically by the logging agent or its plugins
Message attributes – Properties embedded directly in the log message itself

Tags: Properties of the Dataset

Tags apply to all entries in a stream of events and are not visible as part of the log event itself. They are ideal for separating environments at query time (e.g. avoid mixing staging and prod).

Examples of good tags:

environment=production
account_id=12345678
region=us-east-1

Set tags via agent configuration so they are applied automatically to all data processed by that agent. Configuration management tools such as Terraform or CloudFormation can set these tags consistently across your infrastructure.

Log Metadata: Properties of the Source

Log metadata are key–value pairs associated with a specific log, typically added by the agent (often via plugins), not by the application itself. It usually describes:

The host or node — e.g. host_name=web-01, os=linux
The pod or container — e.g. pod_name=api-6c8d3f5c2f-wz2vt, namespace=payments
The service name and version — e.g. service=checkout-api, version=2.3.1

A key point: a single agent can process data from multiple hosts, pods, services, or versions, and the metadata will reflect those differences on a per-record basis.

Message Attributes: Properties Inside the Log Message

Message attributes are key–value pairs present inside the log message body itself, authored by application developers and specific to a single log entry. They're ideal for capturing fine-grained, per-request context:

{"level":"info","message":"request processed","duration_ms":123}

Common examples:

duration_ms=123
request_id=abc-123
retry_count=2

Two supported formats out of the box:

The entire message follows JSON format
key=value format within the log message (values may be quoted; : can be used instead of =)

Note: Indexing is automatic in modern logging platforms — manually managing and configuring indexes is a time-consuming and cumbersome task you shouldn't need to do.

Exception and Stack Trace Handling

Use agent-side multiline support (e.g., FluentBit multiline filter) to capture stack traces as single log events
Report exception name and stack trace as structured attributes:

exception.type
exception.stacktrace

This makes it easy to query and alert on recurring or unexpected exceptions.

2. Correlation

Trace and Correlation IDs

Add fields like trace_id, span_id, and request_id to your logs so you can tie them back to a single user request or workflow across multiple services. In a distributed system, a single call can pass through frontends, APIs, queues, and background workers — without a shared ID, the logs from each hop look like isolated events.

With a common ID, you can filter on that value and reconstruct the full timeline of "what happened where and when," instead of guessing based on timestamps and hosts.

How to add them — it's usually a combination of code and tooling:

A tracing library or standard (such as OpenTelemetry) generates and propagates trace and span context across service boundaries. Most logging frameworks can be configured to automatically include those IDs on every log entry.
At the same time, use an application-level request_id or correlation ID (often taken from or added to an HTTP header at the edge) and pass it through your services.

A robust setup does both: use tracing context (trace_id, span_id) and ensure they are consistently present in logs so any logging or observability system can correlate events end-to-end.

3. Agent Configuration & Processing

The OpenTelemetry Collector and similar agents like Fluentbit, Logstash, and Vector can enrich, sanitize, and optimize log data before it ever reaches storage.

Recommended Configurations

Redact PII before logs leave your infrastructure. Mask or drop fields like emails, full names, IPs, IDs, and tokens at the agent or collector level — so even if logs are leaked or shared, sensitive data isn't exposed.

Configure multiline stacktrace handling so full exceptions are captured as a single log event instead of being split into many noisy lines. This typically means using a multiline rule that continues a record while lines match patterns like ^\s+at.

Normalize log levels before shipping. If you don't, breakdowns by log level in dashboards will look fragmented — instead of a clean INFO / WARN / ERROR, you'll see multiple tiny buckets like info, Info, INFO, error, and ERR that all mean the same thing.

Use batch and memory limiter processors (for example with OTel):

Processor	What it does
`batch`	Groups spans/logs/metrics into batches, improves throughput, reduces overhead
`memory_limiter`	Puts a hard cap on memory usage, drops data or throttles when usage exceeds thresholds

Strike a balance: let agents fix inconsistencies from 3rd-party logs, but rely on developers to structure first-party logs correctly.

4. Team Practices & Ownership

Why It Matters

Logging is not just a technical setup — it's a shared responsibility across teams. Establishing clear ownership early ensures that logs are consistent, searchable, and actionable throughout your organization's lifecycle. It also makes it clear who is accountable for volume control (for example, leaving DEBUG on in production).

Best Practices

Assign team ownership from day one. Each dataset or service should have a defined owning team responsible for log quality, metadata, and alerting setup. This avoids confusion later when troubleshooting or optimizing costs.

Tag logs by team. Include a team or owner tag in metadata or agent configuration. This enables your logging platform to group logs, usage metrics, and cost by responsible team automatically — particularly useful when understanding volume spikes. Set up usage alerts so a given team is notified if their volumes suddenly go off the charts.

Encourage collaboration through shared queries. Make it a habit for teams to share saved queries, dashboards, and monitors. Common examples:

"Error spikes by environment"
"Token usage per service"
"Slowest response patterns over 24h"

Shared queries reduce duplication and foster best-practice discovery internally.

Use team-based datasets. Group data logically — by service ownership rather than by underlying infrastructure — so each team can monitor the performance, health, and behavior of their own services without noise from unrelated systems.

Make accountability visible. Use tags and naming conventions that make ownership clear:

team=payments
service=checkout-api
env=prod

Pro Tip: Building a strong observability culture that promotes best practices early creates long-term efficiency. Teams that own their data from the start rarely need a cleanup project later.

5. Log Types and Strategy

Define what types of logs your organization will collect and how they'll be categorized:

Type	Examples	Notes
Application	Custom app logs	Owned by dev teams
Third-party services	Kafka, NGINX, Redis	Semi-structured; normalize via agents or auto-parser
Infrastructure	syslog, journald	Often managed by SREs
Cloud	AWS, GCP, Azure	Forwarding integration needed; can be high volume (CloudTrail, Load Balancer logs)
Security	CloudTrail, auditd	Coordinate with SecOps/SIEM
CI/CD	Pipeline events	Great for trend correlation

Pro Tip: Review overlap between application and infrastructure logs to avoid duplication and unnecessary ingestion usage. If your app logs request_id, user_id, status, and latency, and NGINX/syslog already records status and latency, keep those fields in one layer and use request_id to correlate — instead of ingesting the same details twice.

Wrapping Up

Good logging is a discipline, not a one-time setup. The combination of structured data, consistent metadata, proper correlation IDs, well-configured agents, and clear team ownership is what separates logs that collect dust from logs that actively drive engineering decisions.

Start with structure, assign ownership early, and build the habit of sharing queries and dashboards across teams. Your future self — debugging a production incident at 2am — will thank you.

Give Bronto a Try