di(nara) critskaya

Posted on Aug 21

When Is Monitoring Enough? A Practical Guide to Database Observability

#database #monitoring #softwarephilosophy #dbre

The Question That Started It All

It all began with a simple LinkedIn post titled "When is Monitoring Enough?" while I was building a Grafana dashboard for Apache Ignite. The question hit me like a lightning bolt: When is enough, enough?

We live in a world where databases demand robust monitoring, and countless tools promise to make our lives easier. Yet in practice, these tools often become part of the problem. They're either excessive for our needs, missing critical features, or simply add to the noise rather than providing clarity.

Meanwhile, business requirements push us toward ever-increasing system complexity. We end up in one of two extremes: drowning in monitoring data we can't use, or flying blind without the observability we desperately need.
So the riddle remains: How much monitoring is truly enough?

Monitoring isn't as simple as it looks

You might handle it over to your infrastructure engineers or DBRE with a thought that they will sort it all out, and the case is solved. But here's the problem: If you don't understand what you're looking at, how can you find the root cause of performance issues?

Installing hundreds of dashboards doesn't magically solve your problems. Without understanding, monitoring becomes a maze rather than a map.

We've encountered a dozen scenarios like this one:

Critical issue strikes
Alerts fire everywhere - but which one matters?
Dashboards show data - but what does it mean?
Hours pass while you hunt for the real problem
Knowledge gaps turn a 5-minute fix into a nightmare

The harsh truth: An inadequate monitoring system can be worse than no monitoring at all. It gives false confidence while hiding real problems and thus takes us away.

When it comes to databases, monitoring isn't just about collecting metrics - it's about understanding what to monitor. Your ops engineers might integrate a thousands of dashboards, hundred alerts, though these don't match your needs, it just turns a system into burden.

The real question becomes: "How do we build a monitoring system that actually helps?"

The Abundance Trap

Having too many options can paralyze decision-making. When everything seems important, nothing is. You face critical questions:

Which metrics actually matter for YOUR use case?
How do you interpret these metrics correctly?
What thresholds indicate real problems vs. normal fluctuations?

The abundance that should empower you instead creates confusion during critical moments - exactly when you need clarity most.

Separating Business Needs from Technical Noise

I once worked with a team of developers who had imported pre-built Grafana dashboards. They didn't understand half of what they were looking at and where to look at. I must say the dashboards were impressive, though at the second glance I realized these dashboards were built not for them, and didn't help developers team to solve their problems. When problems hit, the team spent more time trying to decode the dashboards than fixing the issues.

Through painful experience I learned that most monitoring fails not because it shows too little, but because it shows too much. Every additional metric, graph, or alert adds cognitive weight. A radical idea appeared in my head: Start by removing, not adding.

The dashboard (and metrics) must achieve three simple things:

Simplicity - can we understand what we are looking at;
Clarity - the unmistakable message, yet plain, like "Low memory", not "OOM: Out of memory exception";
Actionability - tells you what to do next, bringing clear response.

I made a conclusion in the end of refactoring: Simplicity isn't dumbing down - it's smartening up.

Manage Monitoring Complexity And Build Your Own Philosophy

Research in software engineering shows that our brains can only handle limited complexity. Cognitive load has been widely studied to help understand human performance. When monitoring becomes too complex, it stops helping and starts hurting.

Consider these cognitive limits:

Working memory: Can hold 5-9 items at once
Attention span: Drops significantly after 20 minutes
Context switching: Each switch costs 15-25 minutes of productivity

You shouldn't also ignore the impact of "why" questions for business. By asking such questions as "why do we need it", "why monitor performance" and etc., you can achieve and work out flexible principles which can be rather applied as your own philosophy:

Monitor for decisions, not data - a potiential action is prefered;
Respect human limitations - keep the obvious in plain sight;
Embrace imperfection - enough monitoring that people use beats perfect monitoring they don't;
Iterate relentlessly - start simple and based on actual incidents.

Conclusion

After building monitoring systems for various databases and learning from actual incidents, I've learned that the question isn't "How much can we monitor?" but rather "What monitoring serves our actual needs?"

Like a car dashboard that shows speed, fuel, and warnings - not every detail of engine operation - your database monitoring should surface what matters for safe operation, not everything that's technically possible.

Remember: The goal isn't to monitor everything. It's to understand your system well enough to monitor the right things.

Start small. Stay focused. Iterate based on reality, not anxiety.

And most importantly, always ask yourself: "If this metric changes, what will I do differently?"

If the answer is "nothing," you've found a metric you don't need.

DEV Community