David

Posted on Dec 12, 2025 • Originally published at azure-noob.com

Building Azure Dashboards for Cloud NOC Teams (What Actually Gets Used vs What Gets Ignored)

#azure #monitoring #dashboards #operations

The Dashboard Problem

What we built: 47 tiles showing metrics, logs, alerts, compliance, costs

What NOC uses: 3 tiles

Why: Dashboard answered "what's our resource count?" not "what's broken?"

What NOC Teams Actually Need

Question #1: "What's Down Right Now?"

Not: 15 charts showing healthy services

Yes: List of failures, ranked by business impact

Question #2: "What Needs My Attention?"

Not: 200 active alerts

Yes: 5 critical alerts requiring human action

Question #3: "Is This Normal?"

Not: Current CPU usage

Yes: Current vs 7-day baseline with "normal range" shading

Dashboard Design That Works

Tile 1: Critical Incidents (Top Priority)

Query:

AzureActivity
| where Level == "Critical" or Level == "Error"
| where TimeGenerated > ago(1h)
| summarize Count=count() by ResourceGroup, OperationNameValue
| order by Count desc
| take 10

Display:

Red alert icon
Resource name
Error count
Time since first occurrence
Business impact (if known)

Tile 2: Service Health Issues

Query:

ServiceHealthResources
| where type == "microsoft.resourcehealth/events"
| where properties.status == "Active"
| project ServiceName = properties.service, 
          Issue = properties.title,
          Impact = properties.impact

Display:

Azure service name
Issue description
Affected regions
Link to status page

Tile 3: Failed Deployments

Query:

AzureActivity
| where OperationNameValue contains "Microsoft.Resources/deployments/write"
| where ActivityStatusValue == "Failed"
| where TimeGenerated > ago(24h)
| project TimeGenerated, Caller, ResourceGroup, ErrorMessage = Properties

Display:

Who tried to deploy
What failed
Error message
Time

Tile 4: Abnormal Resource Consumption

Query:

Perf
| where TimeGenerated > ago(1h)
| where CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue) by Computer
| where AvgCPU > 85

Display:

VM name
Current CPU %
Comparison to 7-day average
Threshold breach time

Tile 5: Budget Alerts

Query:

AzureActivity
| where OperationNameValue contains "Microsoft.Consumption"
| where Level == "Warning" or Level == "Error"
| where TimeGenerated > ago(24h)

Display:

Subscription name
Current spend
Budget amount
Forecast end-of-month

What NOT to Include

❌ Resource Counts

Why NOC doesn't care: "We have 847 VMs" doesn't help incident response

Who cares: Capacity planning team

Where it belongs: Monthly capacity review, not NOC dashboard

❌ Compliance Metrics

Why NOC doesn't care: "72% compliant with tag policy" isn't urgent at 2 AM

Who cares: Governance team

Where it belongs: Weekly governance report

❌ Cost Breakdown Charts

Why NOC doesn't care: "Compute is 45% of spend" doesn't help fix outages

Who cares: FinOps team

Where it belongs: Monthly cost review

❌ "Healthy" Status

Why NOC doesn't care: If it's working, they don't need to see it

Better: Only show failures. If dashboard is empty, everything's fine.

Real NOC Dashboard Example

Our 5-tile dashboard:

Critical Alerts (red box, top-left)
- Currently: 0
- If >0: Shows alert details
Service Health (orange box, top-right)
- Currently: 1 (Azure DevOps degraded, East US)
- Impact: Low
Failed Deployments (yellow box, middle-left)
- Last 24h: 3 failures
- Links to logs
High CPU VMs (yellow box, middle-right)
- Currently: 2 VMs over 85%
- Shows VM names, current %
Budget Status (green box, bottom)
- 67% of monthly budget used
- 45% of month elapsed
- Forecast: On track

Total tiles: 5

Time to understand status: 10 seconds

Dashboard Refresh Strategy

Real-Time Data (1-minute refresh)

Critical alerts
Service health
High CPU/memory

Near Real-Time (5-minute refresh)

Failed deployments
Error logs
Network issues

Hourly Refresh

Budget status
Backup failures
Compliance alerts

Common Mistakes

❌ Mistake #1: Too Many Tiles

Problem: 47 tiles, can't see critical issues

Fix: Maximum 10 tiles, prioritize by urgency

❌ Mistake #2: Showing "Green"

Problem: "99% of services healthy" takes space

Fix: Only show failures. Empty dashboard = everything's fine.

❌ Mistake #3: No Business Context

Problem: "VM-SQL-12 is down" (which app is that?)

Fix: Map VMs to apps in dashboard query

❌ Mistake #4: Metrics Without Baselines

Problem: "CPU is 45%" (is that normal?)

Fix: Show current vs 7-day average

The "Empty Dashboard Is Good" Philosophy

Traditional thinking: Dashboard must always show data

Better thinking: Dashboard shows PROBLEMS

Result:

Dashboard empty most of the time
When something appears, it's urgent
NOC knows exactly what to fix

Multi-Team Dashboard Strategy

Don't: One dashboard for everyone

Do: Separate dashboards per team:

NOC Dashboard

Incidents requiring immediate action
5 tiles, 10-second understanding

FinOps Dashboard

Cost trends
Budget tracking
Reservation coverage

Security Dashboard

Security alerts
Compliance violations
Vulnerability scans

Capacity Dashboard

Resource utilization
Growth trends
Forecast capacity needs

Full Dashboard Templates

Complete KQL queries, Azure Monitor Workbook templates, and multi-team dashboard architecture:

👉 Azure NOC Dashboard Complete Guide

Building dashboards for NOC teams? Show problems, not status. Empty dashboard = everything's working. That's success.

DEV Community

Building Azure Dashboards for Cloud NOC Teams (What Actually Gets Used vs What Gets Ignored)

The Dashboard Problem

What NOC Teams Actually Need

Question #1: "What's Down Right Now?"

Question #2: "What Needs My Attention?"

Question #3: "Is This Normal?"

Dashboard Design That Works

Tile 1: Critical Incidents (Top Priority)

Tile 2: Service Health Issues

Tile 3: Failed Deployments

Tile 4: Abnormal Resource Consumption

Tile 5: Budget Alerts

What NOT to Include

❌ Resource Counts

❌ Compliance Metrics

❌ Cost Breakdown Charts

❌ "Healthy" Status

Real NOC Dashboard Example

Dashboard Refresh Strategy

Real-Time Data (1-minute refresh)

Near Real-Time (5-minute refresh)

Hourly Refresh

Common Mistakes

❌ Mistake #1: Too Many Tiles

❌ Mistake #2: Showing "Green"

❌ Mistake #3: No Business Context

❌ Mistake #4: Metrics Without Baselines

The "Empty Dashboard Is Good" Philosophy

Multi-Team Dashboard Strategy

NOC Dashboard

FinOps Dashboard

Security Dashboard

Capacity Dashboard

Full Dashboard Templates

Top comments (0)