It's 2 AM. You get a call: "The website is broken."
You SSH into your server, run top to check CPU, maybe df -h for disk space. Everything looks... fine? You restart the application. It works again. But you're left wondering: what actually went wrong?
This happens because we don't have visibility into what's happening inside our systems. That's what observability solves.
The Three Pillars
Observability data comes in three forms:
- Metrics: Numbers that change over time
- Logs: Detailed records of specific events
- Traces: Maps of requests flowing through distributed systems
Each answers different questions. Let's break them down.
Metrics: Is Everything OK?
Metrics are numbers that change over time. Think of them as your app's vital signs—temperature and pulse, measured continuously.
# Server health
cpu_usage_percent = 45
memory_usage_percent = 67
disk_usage_percent = 23
# Application health
requests_per_minute = 120
response_time_ms = 250
failed_requests_percent = 0.8
active_users = 43
Metrics tell you something is wrong before users complain. If failed_requests_percent jumps from 0.8% to 15%, you know there's a problem.
When to Use Metrics
- Dashboards: Visualize trends over time
- Alerting: Get notified when thresholds are breached
- Capacity planning: Predict when you'll run out of resources
- SLO tracking: Monitor service level objectives
Example
Your response time metric shows requests normally taking 200ms are now taking 2000ms. You check and find the database connection pool is exhausted. Fixed before users notice.
Metrics answer: "Is my system healthy?"
Logs: What Exactly Happened?
Metrics tell you that something is wrong. Logs tell you what.
Logs are detailed records of specific events. They're a diary of everything your app does.
Instead of just knowing "more requests are failing," logs show:
2024-08-08T14:30:15Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password
2024-08-08T14:30:16Z ERROR [AuthService] Failed login attempt for user@email.com: invalid password
2024-08-08T14:30:17Z ERROR [AuthService] Failed login attempt for user@email.com: account locked after 3 failed attempts
2024-08-08T14:30:45Z INFO [AuthService] Password reset requested for user@email.com
Now the "failed requests spike" makes sense—a user forgot their password. Not a bug, just expected behavior.
Structure Your Logs
Random text logs become useless at scale. Use structured logging (JSON):
{
"timestamp": "2024-08-08T14:30:15Z",
"level": "ERROR",
"service": "AuthService",
"message": "Failed login attempt",
"user_id": "12345",
"email": "user@email.com",
"reason": "invalid_password",
"attempt_count": 1
}
Structured logs let you query:
- "Show me all errors from AuthService in the last hour"
- "How many failed login attempts did user 12345 have today?"
Logs answer: "What exactly happened?"
Traces: Where Did It Break?
Your single-server app grows into microservices:
- User Service: Authentication and profiles
- Product Service: Product catalog
- Order Service: Processes purchases
- Payment Service: Handles transactions
A user reports: "I can't complete my purchase. The page just hangs."
You check metrics—all services look healthy. You check logs in each service... but which service did the request even touch? How do you follow a single user's journey across multiple services?
Traces Connect the Dots
A trace shows the path of a single request through your distributed system.
When a user clicks "Buy Now":
- Order Service receives the request and creates a trace ID (e.g., "abc123")
- It records: "I'm processing order abc123, started at 10:30:15"
- When it calls User Service, it passes along that trace ID
- User Service records: "I'm verifying user for trace abc123"
- When User Service calls Payment Service, same trace ID
- Payment Service records: "I'm charging card for trace abc123"
Each service leaves breadcrumbs connected by the same trace ID. The result:
Purchase Request [1,200ms total]
├── Order Service: Process Order [50ms]
├── User Service: Verify User [100ms] ✓
├── Product Service: Check Inventory [150ms] ✓
├── Payment Service: Charge Card [900ms] ⚠️
│ ├── Validate Card [100ms] ✓
│ ├── External Payment Gateway [750ms] ⚠️
│ └── Update Transaction [50ms] ✓
└── Order Service: Finalize Order [100ms] ✓
The problem is obvious: the external payment gateway is taking 750ms.
When to Use Traces
- Debugging latency: Find which service is slow
- Understanding dependencies: See how services interact
- Root cause analysis: Follow a failing request across the stack
- Performance optimization: Identify bottlenecks
Traces answer: "Where in my distributed system did things go wrong?"
Combining All Three: A Real Scenario
Your e-commerce site is struggling during a Black Friday sale.
Step 1: Metrics
Response times are spiking. Error rate is up.
Step 2: Logs
Timeout errors in the Payment Service.
Step 3: Traces
Payment requests take 10+ seconds, but only for orders over $500.
Root cause: The fraud detection system (called by Payment Service) has a bug that makes it extremely slow for high-value transactions.
Without all three, you might have wasted hours checking database connections, server resources, or network issues.
Quick Reference: When to Use What
| Question | Use |
|---|---|
| Is my system healthy? | Metrics |
| What exactly happened? | Logs |
| Where did it break (distributed)? | Traces |
| Should I alert on-call? | Metrics (thresholds) |
| Why did this specific request fail? | Logs + Traces |
| Which service is the bottleneck? | Traces |
Common Mistakes
Metric overload
Don't track everything. Start with what matters to users: latency, error rate, throughput.
Unstructured logs
console.log("error happened") is useless at scale. Add context, use JSON.
Tracing everything
Sample your traces. 100% trace coverage kills performance and storage.
Using them in isolation
Each pillar is useful alone. Together, they're powerful. Correlate metrics spikes with log errors with trace timelines.
Getting Started
If you're not sure where to begin:
- Start with metrics for the RED method: Rate (requests/sec), Errors (error rate), Duration (latency)
- Add structured logging to your services
- Implement tracing when you have 2+ services communicating
Sign up for a 14-day free OpenObserve Cloud trial and integrate your metrics, logs, and traces into one powerful platform to boost your operational efficiency and enable smarter, faster decision-making.
What's your observability setup? Are you using all three pillars, or still figuring out where to start? Let me know in the comments 👇




Top comments (0)