Chisom Chima

Posted on Apr 27

Observability vs Monitoring

#devops #beginners #backend #webdev

Here is a situation that has probably happened to you at some point.

A user submits a support ticket saying the app is "slow." You check your dashboards. CPU looks fine. Memory looks fine. Error rate is zero. Everything is green. But the user is right, something is wrong. You just cannot see what it is.

So you start guessing. Maybe it's the database. Maybe it's a third-party API. Maybe it's that new endpoint that shipped last week. You add some console.log statements, redeploy, wait for it to happen again, and hope the logs tell you something useful.

That is what life looks like without observability. And honestly, it is the reality for most engineering teams, even the ones that think they have "good monitoring."

The Word Everyone Gets Wrong

Monitoring and observability get used as synonyms all the time, even by experienced engineers who should know better. They are related, but they describe fundamentally different things, and mixing them up leads to blind spots that cost you hours of painful debugging.

The clearest way to think about it:

Monitoring tells you that something is wrong.

Observability tells you why.

Monitoring is about watching predefined metrics like CPU usage, memory, request counts, and error rates, then alerting you when one of them crosses a threshold you set in advance. It works great for problems you already know about and thought to measure ahead of time.

Observability is about the ability to understand the internal state of a system just by looking at the data it produces. It handles the problems you did not predict and never wrote an alert for. The ones that only appear on Tuesdays at 3pm for users in a specific region making a very specific sequence of requests.

The key word in that definition is ability. Observability is not a tool you install. It is a property your system either has or does not have, and building it requires intentional decisions at every layer of your stack.

Why Monitoring Alone Is Not Enough

Imagine your house has a single smoke detector in the hallway. If there is a fire, it goes off. Great, that is monitoring. You know something is wrong.

Now imagine the fire is in the kitchen. Or in the basement. Or it is not actually a fire but a slow gas leak that has not ignited yet. The smoke detector cannot help you understand any of that. It only knows one thing: smoke threshold crossed, yes or no.

Most monitoring setups work exactly like this. They answer binary questions about things you already anticipated.

The problem with modern software is that most of the interesting failures are not binary and were not anticipated. A service call that normally takes 80 milliseconds starts taking 800 milliseconds for about 3% of requests. No error is thrown. No threshold is breached. Users just notice the app feels sluggish on certain actions, and they start quietly switching to a competitor.

Traditional monitoring has nothing to say about this. Observability does.

The Three Pillars (And What They Actually Mean)

You will hear observability described through three data types: logs, metrics, and traces. Most articles just list them and move on. I want to actually explain what each one does and why you need all three working together.

Logs

Logs are the oldest and most familiar tool. They are records of things that happened, written to a file or a stream as your application runs.

2026-04-18T14:22:31Z INFO  User 8821 requested /orders/summary
2026-04-18T14:22:31Z INFO  Fetching orders from database
2026-04-18T14:22:32Z WARN  Database query took 943ms (threshold: 500ms)
2026-04-18T14:22:32Z INFO  Returned 12 orders to user 8821

Good logs tell you what happened, in what order, and with enough context to reconstruct the sequence of events. Bad logs tell you almost nothing useful, things like Error occurred or Request failed with no indication of which request, which user, or what the actual error was.

The problem with logs alone is that they become overwhelming fast. A service handling a few thousand requests per second might produce millions of log lines per hour. Finding the one that explains your bug feels like searching for a specific sentence in a library with no catalogue system.

Metrics

Metrics are numerical measurements collected over time. Unlike logs, which capture individual events, metrics are aggregated. They answer questions like:

What is the average response time over the last five minutes?
How many requests per second are hitting this endpoint?
What percentage of database connections are currently in use?
How many items are sitting in a queue waiting to be processed? Metrics are excellent for spotting trends and triggering alerts. They are also cheap to store because a single number replaces thousands of individual log lines. The tradeoff is that aggregation destroys detail. If your average response time is 200ms but your 99th percentile sits at 4 seconds, the average makes everything look fine while a real slice of your users are having a genuinely terrible experience.

Traces

Traces are the pillar that most teams skip entirely, and they are often the most valuable one for debugging anything in a distributed system.

A trace follows a single request as it travels through your entire system: from the browser, through your API gateway, into Service A, which calls Service B, which queries the database, which calls an external payment API, and eventually returns a response to the user.

Each step in that journey is called a span. The trace is the collection of all spans for one request, tied together with a shared identifier called a trace ID.

Here is what a trace might look like for a checkout request:

Trace ID: a3f8b21c

[0ms]     API Gateway          received POST /checkout
[2ms]     Auth Service         validated JWT token         (2ms)
[4ms]     Cart Service         fetched cart for user 8821  (18ms)
[22ms]    Inventory Service    checked stock availability  (340ms)  <- slow
[362ms]   Payment Service      charged card               (89ms)
[451ms]   Order Service        created order record       (12ms)
[463ms]   API Gateway          returned 200 OK

In one view, you can see the Inventory Service took 340 milliseconds, which is where nearly all the latency for this request lived. Without distributed tracing, you would have to correlate timestamps across four separate log files to figure that out, assuming the relevant logs even existed in the first place.

A Real Debugging Scenario

Say you get a Slack alert at 9am: "P95 checkout latency spiked to 4 seconds, normally 600ms." You have about two minutes before your on-call phone starts ringing.

With only monitoring:
You know something is wrong. You open your dashboards. CPU fine, memory fine, error rate zero. You start guessing which service is the culprit and dig through logs hoping something jumps out. Twenty minutes later, maybe you find it.

With full observability:

First, you open your tracing tool and filter for slow checkout requests from the last fifteen minutes. You immediately see that every slow trace shares one thing in common: the Inventory Service span is taking 2-3 seconds instead of the usual 50ms.

Then you click into one of those slow traces and look at the logs attached to that specific span. You see: Inventory cache MISS - falling back to database query.

Next, you check your metrics dashboard for the Inventory Service cache hit rate. It dropped from 94% to 11% at 8:51am, right when the latency started climbing.

Finally, you check what changed at 8:51am. A deployment went out. Someone updated the cache key format, which silently invalidated every cached item in one shot.

Total time from alert to root cause: four minutes. That is what observability actually looks like when it is set up properly.

The Difference in One Sentence

Monitoring answers a question you already thought to ask. Observability lets you ask questions you had not imagined yet.

This distinction matters more than ever because modern systems are not monoliths anymore. A single user action might touch ten services, three databases, two message queues, and a couple of third-party APIs. When something goes wrong in that web of interactions, you cannot possibly have written an alert for every failure mode in advance. You need the ability to explore and investigate freely.

Structured Logging: The Underrated Starting Point

Before reaching for a fancy observability platform, the most impactful thing most teams can do is improve their logs by making them structured.

Unstructured log:

User 8821 checkout failed after 3.2s

Structured log (JSON):

{
  "timestamp": "2026-04-18T09:14:22Z",
  "level": "error",
  "event": "checkout_failed",
  "user_id": 8821,
  "duration_ms": 3200,
  "trace_id": "a3f8b21c",
  "failed_service": "inventory",
  "error_code": "CACHE_MISS_TIMEOUT"
}

The structured version is searchable, filterable, and joinable with other records. You can ask your logging system to show you all checkout failures in the last hour where failed_service is inventory. With unstructured logs, you are doing regex searches and crossing your fingers.

Most modern logging libraries support structured output out of the box. Turning it on is usually a single configuration change.

Tools Worth Knowing About

You do not need to build any of this from scratch. The ecosystem has matured a lot in the last few years.

For metrics, Prometheus is the open-source standard. It scrapes numeric measurements from your services and stores them as time-series data. Pair it with Grafana to build dashboards and set alerts, and you have a solid foundation that thousands of companies run in production today.

For distributed tracing, OpenTelemetry is the project that matters most right now. It is an open standard with vendor-neutral instrumentation libraries that you add to your services to emit traces, metrics, and logs in a consistent format. Once you instrument your services with OpenTelemetry, you can send that data to whichever backend you prefer. Jaeger is open source and great for getting started. Tempo from Grafana integrates cleanly with the rest of that stack. Managed services like Honeycomb or Datadog are solid options if you want less operational overhead.

For logs, Loki from Grafana is a lightweight option that plays nicely with the rest of that ecosystem. If you are already running the ELK stack (Elasticsearch, Logstash, Kibana), that works well too, though it is heavier to maintain long-term.

The full Grafana stack, meaning Prometheus, Loki, Tempo, and Grafana together, gives you all three pillars in a single unified interface and is entirely free to self-host. For most teams just getting started, this is the most practical path forward.

When Monitoring Is Actually the Right Tool

This is worth being clear about: monitoring is not obsolete. It is still the right tool for predictable, well-understood failure modes.

Is the service up or down? Monitoring. Set an alert and move on.

Is database storage above 85%? Monitoring. Simple threshold, simple alert.

Is the TLS certificate expiring in the next seven days? Monitoring. Done.

Observability earns its added complexity when you have distributed systems, when failure modes are unpredictable, and when the cost of long debugging sessions is high. A solo developer building a side project probably does not need distributed tracing. An engineering team running twenty microservices and handling millions of users almost certainly does.

The Mindset Shift

The deeper change that observability asks for is not really technical at its core. It is about how you think about your systems.

With a monitoring mindset, you assume you know what can go wrong and you write alerts for it. You are reactive to events you already predicted.

With an observability mindset, you accept that you cannot predict everything. So instead, you invest in making your system explorable. When something unexpected happens, you have enough data to reason about it from the outside, without needing to reproduce it locally or add new instrumentation after the fact and wait for the bug to resurface.

This shift sometimes gets described as moving from handling known unknowns to handling unknown unknowns. Monitoring covers what you know you do not know. Observability covers what you had no idea you did not know.

Production systems fail in genuinely creative ways. The more complex your architecture, the more creative those failures get. Knowing a server is down is easy. Understanding why 3% of users experience five-second delays on a Tuesday afternoon after making a specific sequence of requests that nobody thought to test together requires observability.

You do not have to build it all at once. Start with structured logs. Add metrics with Prometheus. Instrument one or two critical paths with OpenTelemetry traces. Each layer gives you more signal, and more signal means shorter debugging sessions and faster fixes.

That is the whole point. Not the dashboards, not the tools. The shorter debugging sessions.

DEV Community