DEV Community

Cover image for Logs, Metrics, and Traces: What They Are and When to Use Each
Simran Kumari
Simran Kumari

Posted on • Originally published at openobserve.ai

Logs, Metrics, and Traces: What They Are and When to Use Each

It's 2 AM. You get a call: "The website is broken."

You SSH into your server, run top to check CPU, maybe df -h for disk space. Everything looks... fine? You restart the application. It works again. But you're left wondering: what actually went wrong?

This happens because we don't have visibility into what's happening inside our systems. That's what observability solves.

The Three Pillars

Observability data comes in three forms:

  • Metrics: Numbers that change over time
  • Logs: Detailed records of specific events
  • Traces: Maps of requests flowing through distributed systems

Logs, metrics and traces: The pillars of Observability

Each answers different questions. Let's break them down.


Metrics: Is Everything OK?

Metrics are numbers that change over time. Think of them as your app's vital signs—temperature and pulse, measured continuously.

# Server health
cpu_usage_percent = 45
memory_usage_percent = 67
disk_usage_percent = 23

# Application health  
requests_per_minute = 120
response_time_ms = 250
failed_requests_percent = 0.8
active_users = 43
Enter fullscreen mode Exit fullscreen mode

Metrics tell you something is wrong before users complain. If failed_requests_percent jumps from 0.8% to 15%, you know there's a problem.

When to Use Metrics

  • Dashboards: Visualize trends over time
  • Alerting: Get notified when thresholds are breached
  • Capacity planning: Predict when you'll run out of resources
  • SLO tracking: Monitor service level objectives

Example

Your response time metric shows requests normally taking 200ms are now taking 2000ms. You check and find the database connection pool is exhausted. Fixed before users notice.

Sample dashboard showing metrics over time

Metrics answer: "Is my system healthy?"


Logs: What Exactly Happened?

Metrics tell you that something is wrong. Logs tell you what.

Logs are detailed records of specific events. They're a diary of everything your app does.

Instead of just knowing "more requests are failing," logs show:

2024-08-08T14:30:15Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password
2024-08-08T14:30:16Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password  
2024-08-08T14:30:17Z ERROR [AuthService] Failed login attempt for user@email.com: account locked after 3 failed attempts
2024-08-08T14:30:45Z INFO [AuthService] Password reset requested for user@email.com
Enter fullscreen mode Exit fullscreen mode

Now the "failed requests spike" makes sense—a user forgot their password. Not a bug, just expected behavior.

Structure Your Logs

Random text logs become useless at scale. Use structured logging (JSON):

{
  "timestamp": "2024-08-08T14:30:15Z",
  "level": "ERROR",
  "service": "AuthService", 
  "message": "Failed login attempt",
  "user_id": "12345",
  "email": "user@email.com",
  "reason": "invalid_password",
  "attempt_count": 1
}
Enter fullscreen mode Exit fullscreen mode

Structured logs let you query:

  • "Show me all errors from AuthService in the last hour"
  • "How many failed login attempts did user 12345 have today?"

Filtering logs by service and level

Logs answer: "What exactly happened?"


Traces: Where Did It Break?

Your single-server app grows into microservices:

  • User Service: Authentication and profiles
  • Product Service: Product catalog
  • Order Service: Processes purchases
  • Payment Service: Handles transactions

A user reports: "I can't complete my purchase. The page just hangs."

You check metrics—all services look healthy. You check logs in each service... but which service did the request even touch? How do you follow a single user's journey across multiple services?

Traces Connect the Dots

A trace shows the path of a single request through your distributed system.

When a user clicks "Buy Now":

  1. Order Service receives the request and creates a trace ID (e.g., "abc123")
  2. It records: "I'm processing order abc123, started at 10:30:15"
  3. When it calls User Service, it passes along that trace ID
  4. User Service records: "I'm verifying user for trace abc123"
  5. When User Service calls Payment Service, same trace ID
  6. Payment Service records: "I'm charging card for trace abc123"

Each service leaves breadcrumbs connected by the same trace ID. The result:

Purchase Request [1,200ms total]
├── Order Service: Process Order [50ms]
├── User Service: Verify User [100ms] ✓
├── Product Service: Check Inventory [150ms] ✓  
├── Payment Service: Charge Card [900ms] ⚠️
│   ├── Validate Card [100ms] ✓
│   ├── External Payment Gateway [750ms] ⚠️ 
│   └── Update Transaction [50ms] ✓
└── Order Service: Finalize Order [100ms] ✓
Enter fullscreen mode Exit fullscreen mode

The problem is obvious: the external payment gateway is taking 750ms.

Trace view showing request flow across services

When to Use Traces

  • Debugging latency: Find which service is slow
  • Understanding dependencies: See how services interact
  • Root cause analysis: Follow a failing request across the stack
  • Performance optimization: Identify bottlenecks

Traces answer: "Where in my distributed system did things go wrong?"


Combining All Three: A Real Scenario

Your e-commerce site is struggling during a Black Friday sale.

Step 1: Metrics

Response times are spiking. Error rate is up.

Step 2: Logs

Timeout errors in the Payment Service.

Step 3: Traces

Payment requests take 10+ seconds, but only for orders over $500.

Root cause: The fraud detection system (called by Payment Service) has a bug that makes it extremely slow for high-value transactions.

Without all three, you might have wasted hours checking database connections, server resources, or network issues.


Quick Reference: When to Use What

Question Use
Is my system healthy? Metrics
What exactly happened? Logs
Where did it break (distributed)? Traces
Should I alert on-call? Metrics (thresholds)
Why did this specific request fail? Logs + Traces
Which service is the bottleneck? Traces

Common Mistakes

Metric overload

Don't track everything. Start with what matters to users: latency, error rate, throughput.

Unstructured logs

console.log("error happened") is useless at scale. Add context, use JSON.

Tracing everything

Sample your traces. 100% trace coverage kills performance and storage.

Using them in isolation

Each pillar is useful alone. Together, they're powerful. Correlate metrics spikes with log errors with trace timelines.


Getting Started

If you're not sure where to begin:

  1. Start with metrics for the RED method: Rate (requests/sec), Errors (error rate), Duration (latency)
  2. Add structured logging to your services
  3. Implement tracing when you have 2+ services communicating

Sign up for a 14-day free OpenObserve Cloud trial and integrate your metrics, logs, and traces into one powerful platform to boost your operational efficiency and enable smarter, faster decision-making.


What's your observability setup? Are you using all three pillars, or still figuring out where to start? Let me know in the comments 👇

Top comments (0)