Simran Kumari

Posted on Feb 2 • Originally published at openobserve.ai

Logs, Metrics, and Traces: What They Are and When to Use Each

#observability #devops #monitoring #distributedsystems

It's 2 AM. You get a call: "The website is broken."

You SSH into your server, run top to check CPU, maybe df -h for disk space. Everything looks... fine? You restart the application. It works again. But you're left wondering: what actually went wrong?

This happens because we don't have visibility into what's happening inside our systems. That's what observability solves.

The Three Pillars

Observability data comes in three forms:

Metrics: Numbers that change over time
Logs: Detailed records of specific events
Traces: Maps of requests flowing through distributed systems

Each answers different questions. Let's break them down.

Metrics: Is Everything OK?

Metrics are numbers that change over time. Think of them as your app's vital signs—temperature and pulse, measured continuously.

# Server health
cpu_usage_percent = 45
memory_usage_percent = 67
disk_usage_percent = 23

# Application health  
requests_per_minute = 120
response_time_ms = 250
failed_requests_percent = 0.8
active_users = 43

Metrics tell you something is wrong before users complain. If failed_requests_percent jumps from 0.8% to 15%, you know there's a problem.

When to Use Metrics

Dashboards: Visualize trends over time
Alerting: Get notified when thresholds are breached
Capacity planning: Predict when you'll run out of resources
SLO tracking: Monitor service level objectives

Example

Your response time metric shows requests normally taking 200ms are now taking 2000ms. You check and find the database connection pool is exhausted. Fixed before users notice.

Metrics answer: "Is my system healthy?"

Logs: What Exactly Happened?

Metrics tell you that something is wrong. Logs tell you what.

Logs are detailed records of specific events. They're a diary of everything your app does.

Instead of just knowing "more requests are failing," logs show:

2024-08-08T14:30:15Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password
2024-08-08T14:30:16Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password  
2024-08-08T14:30:17Z ERROR [AuthService] Failed login attempt for user@email.com: account locked after 3 failed attempts
2024-08-08T14:30:45Z INFO [AuthService] Password reset requested for user@email.com

Now the "failed requests spike" makes sense—a user forgot their password. Not a bug, just expected behavior.

Structure Your Logs

Random text logs become useless at scale. Use structured logging (JSON):

{
  "timestamp": "2024-08-08T14:30:15Z",
  "level": "ERROR",
  "service": "AuthService", 
  "message": "Failed login attempt",
  "user_id": "12345",
  "email": "user@email.com",
  "reason": "invalid_password",
  "attempt_count": 1
}

Structured logs let you query:

"Show me all errors from AuthService in the last hour"
"How many failed login attempts did user 12345 have today?"

Logs answer: "What exactly happened?"

Traces: Where Did It Break?

Your single-server app grows into microservices:

User Service: Authentication and profiles
Product Service: Product catalog
Order Service: Processes purchases
Payment Service: Handles transactions

A user reports: "I can't complete my purchase. The page just hangs."

You check metrics—all services look healthy. You check logs in each service... but which service did the request even touch? How do you follow a single user's journey across multiple services?

Traces Connect the Dots

A trace shows the path of a single request through your distributed system.

When a user clicks "Buy Now":

Order Service receives the request and creates a trace ID (e.g., "abc123")
It records: "I'm processing order abc123, started at 10:30:15"
When it calls User Service, it passes along that trace ID
User Service records: "I'm verifying user for trace abc123"
When User Service calls Payment Service, same trace ID
Payment Service records: "I'm charging card for trace abc123"

Each service leaves breadcrumbs connected by the same trace ID. The result:

Purchase Request [1,200ms total]
├── Order Service: Process Order [50ms]
├── User Service: Verify User [100ms] ✓
├── Product Service: Check Inventory [150ms] ✓  
├── Payment Service: Charge Card [900ms] ⚠️
│   ├── Validate Card [100ms] ✓
│   ├── External Payment Gateway [750ms] ⚠️ 
│   └── Update Transaction [50ms] ✓
└── Order Service: Finalize Order [100ms] ✓

The problem is obvious: the external payment gateway is taking 750ms.

When to Use Traces

Debugging latency: Find which service is slow
Understanding dependencies: See how services interact
Root cause analysis: Follow a failing request across the stack
Performance optimization: Identify bottlenecks

Traces answer: "Where in my distributed system did things go wrong?"

Combining All Three: A Real Scenario

Your e-commerce site is struggling during a Black Friday sale.

Step 1: Metrics

Response times are spiking. Error rate is up.

Step 2: Logs

Timeout errors in the Payment Service.

Step 3: Traces

Payment requests take 10+ seconds, but only for orders over $500.

Root cause: The fraud detection system (called by Payment Service) has a bug that makes it extremely slow for high-value transactions.

Without all three, you might have wasted hours checking database connections, server resources, or network issues.

Quick Reference: When to Use What

Question	Use
Is my system healthy?	Metrics
What exactly happened?	Logs
Where did it break (distributed)?	Traces
Should I alert on-call?	Metrics (thresholds)
Why did this specific request fail?	Logs + Traces
Which service is the bottleneck?	Traces

Common Mistakes

Metric overload

Don't track everything. Start with what matters to users: latency, error rate, throughput.

Unstructured logs

console.log("error happened") is useless at scale. Add context, use JSON.

Tracing everything

Sample your traces. 100% trace coverage kills performance and storage.

Using them in isolation

Each pillar is useful alone. Together, they're powerful. Correlate metrics spikes with log errors with trace timelines.

Getting Started

If you're not sure where to begin:

Start with metrics for the RED method: Rate (requests/sec), Errors (error rate), Duration (latency)
Add structured logging to your services
Implement tracing when you have 2+ services communicating

Sign up for a 14-day free OpenObserve Cloud trial and integrate your metrics, logs, and traces into one powerful platform to boost your operational efficiency and enable smarter, faster decision-making.

What's your observability setup? Are you using all three pillars, or still figuring out where to start? Let me know in the comments 👇

DEV Community