Observability- My New Experience and Beyond

#observability #distributedsystems #cloud #mlops

From AI/ML Background...

In this article, I’m trying to jot down my journey, moving from being an AI engineer, living deep in models, data, and drift, to stepping into the world of observability. I’m not going too deep into the transition itself, but more into what I’ve learned about observability: its purpose, where it fits, how to use it, and the stack that actually makes sense in real-world engineering.

If you come from AI or ML, you probably think you get monitoring. We keep an eye on pipelines, stare at dashboards, track every metric we can get our hands on. We’re obsessed with recall, precision, AUC, all those numbers that tell us if the model’s still alive. In MLOps, it’s all about performance. Is the model still making sense in the real world? Should we retrain? When do we push the next checkpoint? And, most importantly, how do we do it without breaking everything for users? That’s the game: ship the next version, quietly, while everyone keeps moving along like nothing happened.

But in modern cloud systems, what we track and how we track it is a completely different beast.

Today we have scrapers, agents, fetchers, exporters, service meshes, sidecars, dashboards, all trying to answer one single question: “What’s happening?”
The irony? Teams pour time and money into these tools and still don’t have a real handle on their systems. I’ve seen organizations with the best monitoring stack, tons of fancy dashboards, and still nobody knows what’s actually going wrong when something breaks. It’s frustrating because most of these setups tell you things are happening, not why they are happening.

That’s where the transition hit me hard.

ML monitoring is narrow, purpose-driven.
Cloud observability is wide, chaotic, systemic.

In AI/ML, the model gets all the attention. It’s the prize everyone’s guarding, and most of our work goes into making sure it stays useful. We keep an eye on data pipelines so nothing goes stale, check if our predictions still match what we saw during training or local env workings, like its not only a ML model, even If we deploy a RAG model or even you use any LLMs, we just see the actul output and the Cost trackings only.

But observability?
It’s not about one component. It’s about everything, every request, every microservice, every hop, every node, every storage layer, every unexpected side effect in your system.

That shift changed how I saw things. I stopped being the person obsessed with just the model and started seeing the whole system as this messy, living thing. Once you dive into observability, you stop asking, “Is it up?” You start asking, “When it goes down, how will I know why?”

And that’s the foundation of this article.

Observability Isn’t Just Dashboards, It’s How You Keep Your Head Above Water in Production

If you’ve spent any time wrangling production systems, you know the drill. The dashboards look perfect, everything’s good, CPU and memory numbers are steady, and the services say they’re “healthy.” Then out of nowhere, users start yelling, latency spikes, and suddenly the business is losing money. That’s when you get it: observability isn’t just another layer on top of monitoring. It’s what keeps you from drowning when things go sideways.

Introducing Signals: Golden Signals, LEST, and the Power of Percentiles

Let’s be real! we engineers get attached to averages, but users? They notice the outliers, the rough edges, the long wait times, the weird glitches. Say your P99 latency suddenly jumps from 110ms to four seconds. The average still looks fine, but users are losing their patience. That’s why you need to nail the Golden Signals: Latency, Errors, Traffic, Saturation. They might sound boring, but they’re the backbone of almost every incident. Track latency spikes with traces, hunt down errors in logs using correlation IDs, figure out if traffic bursts are real people, retries, or just bots, and don’t just trust pretty dashboards, check your queues and throttles.

Then there’s LEST: Logs, Events, Spans, Traces. This is where engineers really get their hands dirty. Logs tell the story, great for debugging and post-mortems. Events flag the big moments. Spans break down what’s happening inside those complex, distributed requests. Traces show you the whole system in motion. Think of metrics as the rough sketch, logs as the details, traces as the journey, and events as the why behind it all. When you pull these threads together, troubleshooting stops feeling like digging through rubble and starts feeling like solving a mystery with all the right clues.

The Pillars of Service: Reliability Goes Beyond Metrics

Here’s something you pick up fast in the trenches, a lot of teams get obsessed with flashy dashboards and think that’s all observability is. The real pros? They see observability as the backbone of reliability engineering.

Let’s break it down. First, there’s Availability. SLAs, SLOs, and SLIs get thrown around a lot, but they’re not just corporate jargon. They’re what help you actually manage pain, your pain, the users’ pain, everyone’s pain. If your on-call folks wake up every other night, your metrics are lying to you. SLOs force you to pay attention to what users really feel, not just what looks pretty on a screen.

Then you’ve got Performance. Everyone loves a good average, right? But the real problems where users start cursing your name, hide in those nasty outliers: P95, P99 latencies, all that. That’s the stuff that makes or breaks user experience.

Last up, Reliability. Reliable systems aren’t the ones that never break. They’re the ones that break in obvious, contained ways, and recover fast. That’s what strong engineering looks like when it’s actually running in production.

Building the Observability Stack

When you get telemetry right, you stop guessing and actually start solving problems. This isn’t about collecting every tool out there, it’s about how they work together when you’re on-call and things are going sideways. Prometheus is your go-to for metrics, grabbing data from exporters all over the place. Just be careful with labels,if you use things like user IDs, UUIDs, or timestamps, Prometheus will slow to a crawl.

Grafana’s the dashboard you actually want to look at. If you’ve got 20 panels jammed in, it’s basically a screensaver, not something that helps you in a pinch. Stick to what matters: error rates, latency percentiles, traffic spikes, and how close your infrastructure is to maxing out. That’s what keeps you afloat.

Loki’s great for logs and won’t destroy your budget. Think of logs as structured stories with correlation IDs, you want to connect the dots, not drown in endless lines of noise (and definitely not rack up a monster cloud bill).

Once your setup grows, Mimir comes in handy with multi-tenancy, long-term storage, and distributed metrics. Suddenly, keeping data around isn’t just a financial headache, it’s a feature.

Tracing tools like Tempo or OpenTelemetry give you superpowers. When calls between services start dragging, a trace tells you exactly where things are stuck—like spotting Service B endlessly retrying because Redis is timing out at just 10% saturation. Finding details like that can save you hours when chaos hits.

Observability as a Culture

Observability isn’t all shiny dashboards and smooth graphs. It’s messy. Every engineer finds out the hard way. Outages? They almost never wave a flag, you have to dig. High cardinality? That’ll wreck your clusters long before you run out of CPU. Logging everything sounds smart, but honestly, it just creates a pile of noise. Alert fatigue wears you down fast. Skip correlation IDs and you can forget about real debugging. And if you’re not paying attention, vendors will eat your budget for breakfast. Even dashboards can get out of hand, sometimes they end up as vanity projects, not real tools. The worst? A gorgeous dashboard that goes silent when everything’s burning.

Here’s the truth: observability isn’t just about tech. It’s about how you work. If developers don’t instrument their code, Operations spends their days putting out fires. Good telemetry starts with devs, emit the right metrics, keep logs structured, use span contexts, skip random labels, and actually respect retention policies. Blame-free post-mortems matter. Alerts should make sense. SLOs should match what users actually care about. A solid system isn’t one that never breaks. It’s one that tells you, loud and clear, when it does.

Wrapping It Up

Observability isn't a checkbox or "slap on Grafana and done." It's a discipline that flips incidents into lessons, mess into method, and dashboards into honest mirrors. Every engineer learns this eventually, often painfully: healthy metrics don't guarantee a healthy system. Get that, and you shift from prettifying screens to crafting systems that talk back meaningfully.