I've built dozens of dashboards. Most have been ignored. A few have been used constantly. The difference isn't the graphs. It's the design.
The 3-second test
A useful dashboard answers 'is everything OK?' in 3 seconds. Not 'let me scroll through 40 graphs to find out.'
Big colored header at the top: green = healthy, yellow = watching, red = broken. That's the 3-second answer. Everything else is drill-down.
The hierarchy rule
Three layers, no more:
- Overview one line per service, status color, key SLI
- Service detail one dashboard per service, 6-12 graphs max
- Deep dive triggered from service detail, domain-specific
Anything beyond 3 layers is 'please get lost in my dashboard tree.'
The on-call test
Imagine you're on-call at 3 AM. You get paged for 'service X is slow.' Can you, in 30 seconds, use this dashboard to tell if the problem is the service itself, its database, its upstream dependency, or its downstream consumers?
If yes, the dashboard works.
If no, redesign.
What to cut
- Graphs with no baseline (flat line or spiky forever how do you know if it's bad?)
- Metrics you've never used in an actual incident
- Vanity metrics (total requests ever)
- Graphs where the y-axis is in units nobody understands
The hidden metric
The real measure of a dashboard's value: does the on-call engineer open it before or after the paging tool?
If they open it first it's their compass.
If they open it only after being paged it's a reference, not a dashboard.
Aim for the first.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)