Grafana Dashboards for Non-Prod Environment Observability: Cost + Performance in One View

#grafana #observability

Grafana Dashboards for Non-Prod Environment Observability: Cost + Performance in One View

Non-production environments are the most expensive things nobody talks about. Flexera's 2023 State of the Cloud report puts non-prod at 30–40% of total cloud spend for mature engineering organizations. Yet most teams have zero observability into those environments: no cost tracking, no idle detection, no accountability.

The fix isn't another tool. You already have Prometheus and Grafana. The missing piece is wiring cost data into the same dashboard your engineers check every morning.

This article shows how to build a single Grafana dashboard that surfaces both environment health (CPU, memory, pod status) and real-time dollar cost, per namespace, per environment, per hour. When an engineer opens the staging dashboard and sees "this cluster cost $89 today and is 4% utilized," they act.

The Split-Screen Problem

Right now, understanding non-prod environments requires opening at least three things: the cloud billing console for cost, Grafana for performance, and kubectl or Lens for pod status. These tools don't talk to each other. Cost data in AWS Cost Explorer is delayed by 24–48 hours. Grafana shows real-time utilization but has no dollar context. Non-prod namespaces rarely appear in either.

The result is that nobody knows what non-prod costs in real time. Idle clusters run overnight, on weekends, and during holidays. Not because engineers don't care, but because the signal that would prompt action doesn't exist.

The architecture we're building collapses these three into one Grafana dashboard backed by Prometheus.

What You Need in Prometheus First

The cost dashboard runs on two data sources that you likely already have: kube-state-metrics and cAdvisor (bundled with kubelet). kube-state-metrics exposes over 100 Kubernetes object metrics: pod phase, container resource requests and limits, node allocatable capacity. cAdvisor provides real-time CPU and memory consumption.

Neither exposes dollar cost directly. That requires a recording rule that computes hourly cost per namespace from first principles.

The key metrics you need from kube-state-metrics:

Metric	What It Measures	Used For
`kube_node_status_allocatable`	CPU/memory available on each node	Cost denominator
`kube_pod_container_resource_requests`	CPU/memory requested per container	Cost allocation
`kube_pod_info`	Pod-to-namespace mapping	Aggregation by team
`kube_node_labels`	Node labels (instance type, zone)	Price lookup
`container_cpu_usage_seconds_total`	Actual CPU consumption	Utilization %

Recording rules are non-negotiable. Raw PromQL cost calculations across all pods and nodes are slow, taking 2–4 seconds at dashboard refresh. A recording rule pre-computes hourly cost every 5 minutes, reducing query time to under 50 milliseconds.

Building the Cost Allocation Formula

The cost allocation model we use is derived from OpenCost's specification, adapted for on-demand pricing. The core idea: a container's cost equals the fraction of node resources it requests, multiplied by the node's hourly price.

For CPU:

container_hourly_cost_cpu = (cpu_request / node_allocatable_cpu) × node_hourly_price

For memory, the same formula applies with memory units. The total container cost takes whichever resource is the binding constraint: the max of CPU and memory allocation fractions. This prevents double-counting on multi-tenant nodes.

For an 8-node staging cluster running t3.large instances on AWS (us-east-1), the math looks like this:

Namespace	CPU Requested	Memory Requested	Hourly Cost	Daily Cost
staging-api	3.2 cores	8 Gi	$2.14	$51.36
staging-db	2.0 cores	16 Gi	$1.87	$44.88
staging-workers	1.5 cores	4 Gi	$0.95	$22.80
staging-infra	0.8 cores	2 Gi	$0.48	$11.52
Total	7.5 cores	30 Gi	$5.44	$130.56

The Prometheus recording rule that produces cost_per_namespace_hourly aggregates these calculations across all containers in a namespace and stores the result as a time series. This time series is what powers every cost panel in the dashboard.

One practical note on pricing: hardcode your node's on-demand hourly price as a Prometheus label or a Grafana variable. If you use reserved instances or savings plans, use the effective hourly rate instead of on-demand. Otherwise you'll overstate cost and confuse engineers.

Designing the Dashboard Layout

The dashboard has three rows. The top row is stat panels: the numbers engineers see in 3 seconds. The middle row is time series: trends over the past 24 hours. The bottom row is a table: the full namespace breakdown with sortable columns.

Panel specifications:

Panel	Type	Data Source	Purpose
Today's Total Cost	Stat	`sum(cost_per_namespace_hourly) * 24`	Instant cost awareness
Average CPU Utilization	Stat	`avg(container_cpu_usage / cpu_request)`	Idle health signal
Cost by Namespace (24h)	Time series	`cost_per_namespace_hourly`	Spot when cost spiked
CPU Utilization by Namespace	Time series	`container_cpu_usage / cpu_request`	Correlate with cost
Namespace Cost Breakdown	Table	joined query	Full accountability view

Set the dashboard's time range default to Last 24 hours with a 15-minute auto-refresh. The cost panels update every 5 minutes (matching the recording rule interval), so finer refresh gives no new data but wastes browser cycles.

Use Grafana's threshold feature on the stat panels. Color the idle percentage panel green below 30%, yellow from 30–70%, red above 70%. That means the environment is mostly idle, a visceral signal that doesn't need explanation.

For the namespace table, add a column that shows hours since last active using time() - last_active_timestamp. If a namespace has had zero HTTP traffic and under 5% CPU for more than 2 hours, flag it as idle. Engineers can sort by this column and immediately see what's safe to shut down.

The Idle Detection Row

Add a fourth row: idle environment detection. This is the dashboard's most actionable panel.

Idle detection requires two signals, not one. Low CPU utilization alone is ambiguous: a batch job looks idle between runs. Pair it with inbound request rate. An environment is idle when both CPU utilization is under 10% and request rate is under 0.1 requests per second for a sustained window of 2 hours.

Display this as a state timeline panel: one row per namespace, green for active, red for idle. This gives immediate visual context across all environments simultaneously. An SRE opening the dashboard sees in 5 seconds which namespaces have been red all morning.

Wire a Grafana alert to this panel. When a namespace stays idle for 2 hours, send a Slack message: staging-api has been idle for 2h 15m. Current cost: $4.28. Shut it down? Include a one-click link to a runbook or automation trigger. That message converts observability into action.

From Dashboard to Action

The dashboard does one thing: make the cost of inaction visible. What teams do with that visibility falls into two patterns.

Manual response: An engineer sees the idle flag, runs a teardown script, restarts the environment before the next test run. This works for small teams with high ownership. It fails at scale. Someone is always on leave, someone always forgets.

Automated scheduling: Tools like ZopNight use the same signals (idle threshold + schedule) to automatically suspend non-prod environments overnight and restart them before business hours. The Grafana dashboard becomes the proof layer: engineers see the cost-before and cost-after, which makes the automation's value concrete rather than theoretical.

Approach	Avg. Daily Savings	Engineer Effort	Reliability
No action (baseline)	$0	0h	N/A
Manual teardown	$62	45 min/day	Low, depends on memory
Automated scheduling	$89	15 min/week	High, runs without intervention

The numbers come from the same cluster we modeled earlier: an 8-node t3.large staging cluster running 24/7. Automated scheduling eliminates 16 hours of idle runtime per day (8pm–8am) and captures weekend spend entirely. The dashboard makes that $89/day saving visible and attributable, which is what gets the automation prioritized and kept running.

Build the dashboard first. The Prometheus data is already there. The recording rules take 20 minutes to write. The Grafana panels take an afternoon. Once engineers see live cost next to live utilization, the conversation about what to automate changes. You stop arguing about whether non-prod costs matter and start deciding what to do about it.