Building Dashboards People Actually Use

#sre #devops #dashboards #observability

I've built dozens of dashboards. Most have been ignored. A few have been used constantly. The difference isn't the graphs. It's the design.

The 3-second test

A useful dashboard answers 'is everything OK?' in 3 seconds. Not 'let me scroll through 40 graphs to find out.'

Big colored header at the top: green = healthy, yellow = watching, red = broken. That's the 3-second answer. Everything else is drill-down.

The hierarchy rule

Three layers, no more:

Overview one line per service, status color, key SLI
Service detail one dashboard per service, 6-12 graphs max
Deep dive triggered from service detail, domain-specific

Anything beyond 3 layers is 'please get lost in my dashboard tree.'

The on-call test

Imagine you're on-call at 3 AM. You get paged for 'service X is slow.' Can you, in 30 seconds, use this dashboard to tell if the problem is the service itself, its database, its upstream dependency, or its downstream consumers?

If yes, the dashboard works.
If no, redesign.

What to cut

Graphs with no baseline (flat line or spiky forever how do you know if it's bad?)
Metrics you've never used in an actual incident
Vanity metrics (total requests ever)
Graphs where the y-axis is in units nobody understands

The hidden metric

The real measure of a dashboard's value: does the on-call engineer open it before or after the paging tool?

If they open it first it's their compass.
If they open it only after being paged it's a reference, not a dashboard.

Aim for the first.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community