Guatu

Posted on Jun 8 • Originally published at guatulabs.dev

Grafana Dashboards: Information Density vs Readability

#grafana #prometheus #monitoring #kubernetes

I spent three hours staring at a "Global Infrastructure" dashboard that took 12 seconds to load, only to realize I couldn't actually tell if my GPU nodes were throttling. I had roughly 40 panels on a single page, ranging from CPU steal percentages to disk IOPS and temperature sensors. It looked like a NASA control room, but it functioned like a legacy database query from 1998.

If you're managing a multi-node cluster or a complex AI pipeline, the temptation is to put every single metric you can possibly scrape into one view. The logic is: "If it's on the screen, I can't miss it." In reality, when everything is highlighted, nothing is. You end up with a dashboard that is visually noisy and computationally expensive.

The Performance Wall

Most people treat Grafana like a static webpage, but every panel is a live query. If you have 40 panels, you're hitting your Prometheus or VictoriaMetrics instance with 40 separate requests every time you refresh the page or change the time range.

Grafana has internal concurrency limits. It doesn't just fire all 40 queries at once; it batches them. When you hit a certain density, you start seeing the "loading" spinners stagger. You'll see the top row pop in, then a three-second gap, then the middle row. This isn't just an annoyance. It's a signal that your dashboard design is fighting the underlying data source.

I've seen this happen most often when people deploy a "thorough" community dashboard from a JSON export without pruning it. You get a beautiful layout, but it's querying metrics you don't even have exporters for, leading to a sea of "No Data" panels that still cost query time.

Information Density vs. Cognitive Load

There is a difference between a "dense" dashboard and a "cluttered" one.

A dense dashboard uses a high ratio of data to pixels. It uses small, efficient visualizations (like Stat panels or Gauges) to show current state, and reserves large Time Series panels for trends.

A cluttered dashboard is just a collection of every graph the engineer thought was "interesting" at the time of creation.

The goal is to reduce the time between looking at the screen and understanding the state of the system. If I have to squint to see if a line is crossing a threshold because there are six other lines in the same color palette, the dashboard has failed.

The Solution: Hierarchical Monitoring

Instead of one "God Dashboard," I moved to a three-tier hierarchy. This separates the "Is it broken?" view from the "Why is it broken?" view.

Tier 1: The Heartbeat (High Density, Low Detail)

This is a single screen. No time series graphs. Only Stat panels and Gauges.

Goal: Binary state. Green = OK, Red = Action Required.
Metrics: Cluster-wide CPU/RAM usage, number of Pending pods, GPU temperature peaks, and API latency.
Behavior: I keep this on a wall monitor. I don't want to see the "wiggle" of a graph; I want to see a red box if a node disappears.

Tier 2: The Service View (Medium Density)

This is where I use variables to filter by namespace or node.

Goal: Identify the specific component failing.
Metrics: Per-pod memory usage, network throughput, and request rates.
Behavior: I use Grafana variables ($node, $namespace) so that one dashboard template serves 20 different services.

Tier 3: The Deep Dive (Low Density, High Detail)

These are specialized dashboards for specific hardware or software.

Goal: Root cause analysis.
Metrics: GPU SM clock speeds, PCIe bus errors, or Longhorn volume replication lag.
Behavior: I only open these when Tier 1 or Tier 2 tells me something is wrong.

Implementing the Architecture

To make this work without manual overhead, I use a combination of Prometheus ServiceMonitors for auto-discovery and ConfigMaps for dashboard versioning.

If you're running GPUs, you shouldn't be manually adding every GPU to a dashboard. Use the nvidia-gpu-exporter and let Prometheus handle the labels.

Here is how I deploy the exporter to ensure the metrics are clean and available for the hierarchical dashboards:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-exporter
spec:
  selector:
    matchLabels:
      app: nvidia-gpu-exporter
  template:
    spec:
      containers:
        - name: nvidia-gpu-exporter
          image: ghcr.io/your-org/nvidia-gpu-exporter:v1.4.1
          ports:
            - containerPort: 9835
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
      runtimeClassName: nvidia

To avoid the "manual update" nightmare, I store my dashboard JSONs in Git and deploy them via ConfigMaps. This allows me to prune unnecessary panels across the entire cluster at once.

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-monitoring-dashboard
  labels:
    grafana_dashboard: "1"
data:
  dashboard.json: |
    {
      "id": null,
      "title": "GPU Health - Tier 2",
      "panels": [
        {
          "type": "stat",
          "title": "GPU Memory Usage",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(dcgm_fb_used) by (instance)"
            }
          ]
        },
        {
          "type": "timeseries",
          "title": "GPU Temperature Trend",
          "targets": [
            {
              "expr": "dcgm_temp"
            }
          ]
        }
      ]
    }

And to ensure Prometheus is actually picking up these metrics without me having to hardcode IPs, I use a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-gpu-exporter
  labels:
    release: monitoring
spec:
  selector:
    matchLabels:
      app: nvidia-gpu-exporter
  endpoints:
    - port: metrics
      interval: 30s

The Gotchas of High-Density Design

Even with a hierarchy, there are a few traps I fell into.

The "Too Many Variables" Trap

I once built a dashboard with six different dropdown variables (Cluster, Namespace, Pod, Container, Disk, GPU). Every time I changed one, Grafana had to re-evaluate every single panel. It felt like the browser was hanging.
The Fix: Limit your top-level variables. Use "chained" variables where the Pod dropdown only shows pods for the selected Namespace.

The Color Palette Problem

When you have 10 lines on one graph, Grafana's default colors start to repeat or become indistinguishable.
The Fix: Use "Overwrites" in the panel settings. Explicitly map a specific metric (e.g., node_cpu_seconds_total{mode="iowait"}) to a specific color like bright orange. This removes the cognitive load of checking the legend every five seconds.

The Refresh Rate Death Spiral

Setting a dashboard to "Auto-refresh: 5s" with 30 panels is a great way to DOS your own Prometheus instance.
The Fix: Tier 1 (Heartbeat) can refresh every 10-15 seconds. Tier 3 (Deep Dive) should be manual. There is no reason to auto-refresh a detailed GPU memory leak analysis every few seconds.

Lessons Learned

The most important thing I learned is that a dashboard is a tool for decision-making, not a data dump.

If you can't look at a dashboard for 5 seconds and tell me exactly what is wrong, it's too dense. I've spent too much time building "cool" dashboards that were useless in a 3 AM outage because I had to hunt through 15 panels to find the one metric that actually mattered.

I've applied this same philosophy to my other infrastructure. For example, when dealing with Longhorn volume health, I stopped trying to track every single replica's sync state on one page. Instead, I created a "Health Score" (a single Stat panel) that only turns red when the aggregate health of the volume drops below 100%.

If you're building out your own monitoring, start with the "Heartbeat" view. Ask yourself: "What is the one number that tells me I need to wake up?" Build that first. Everything else is just a deep dive for when things actually break.

For those managing high-performance AI workloads, this becomes even more critical. Monitoring GPU power states and memory fragmentation requires a different level of granularity than monitoring a web server. If you're struggling to balance the noise of bare-metal Kubernetes with the need for precision, I've dealt with these exact trade-offs in my infrastructure consulting.

Stop adding panels. Start deleting them.

DEV Community