DEV Community

David
David

Posted on • Originally published at azure-noob.com

Building Azure Dashboards for Cloud NOC Teams (What Actually Gets Used vs What Gets Ignored)

The Dashboard Problem

What we built: 47 tiles showing metrics, logs, alerts, compliance, costs

What NOC uses: 3 tiles

Why: Dashboard answered "what's our resource count?" not "what's broken?"

What NOC Teams Actually Need

Question #1: "What's Down Right Now?"

Not: 15 charts showing healthy services

Yes: List of failures, ranked by business impact

Question #2: "What Needs My Attention?"

Not: 200 active alerts

Yes: 5 critical alerts requiring human action

Question #3: "Is This Normal?"

Not: Current CPU usage

Yes: Current vs 7-day baseline with "normal range" shading

Dashboard Design That Works

Tile 1: Critical Incidents (Top Priority)

Query:

AzureActivity
| where Level == "Critical" or Level == "Error"
| where TimeGenerated > ago(1h)
| summarize Count=count() by ResourceGroup, OperationNameValue
| order by Count desc
| take 10
Enter fullscreen mode Exit fullscreen mode

Display:

  • Red alert icon
  • Resource name
  • Error count
  • Time since first occurrence
  • Business impact (if known)

Tile 2: Service Health Issues

Query:

ServiceHealthResources
| where type == "microsoft.resourcehealth/events"
| where properties.status == "Active"
| project ServiceName = properties.service, 
          Issue = properties.title,
          Impact = properties.impact
Enter fullscreen mode Exit fullscreen mode

Display:

  • Azure service name
  • Issue description
  • Affected regions
  • Link to status page

Tile 3: Failed Deployments

Query:

AzureActivity
| where OperationNameValue contains "Microsoft.Resources/deployments/write"
| where ActivityStatusValue == "Failed"
| where TimeGenerated > ago(24h)
| project TimeGenerated, Caller, ResourceGroup, ErrorMessage = Properties
Enter fullscreen mode Exit fullscreen mode

Display:

  • Who tried to deploy
  • What failed
  • Error message
  • Time

Tile 4: Abnormal Resource Consumption

Query:

Perf
| where TimeGenerated > ago(1h)
| where CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue) by Computer
| where AvgCPU > 85
Enter fullscreen mode Exit fullscreen mode

Display:

  • VM name
  • Current CPU %
  • Comparison to 7-day average
  • Threshold breach time

Tile 5: Budget Alerts

Query:

AzureActivity
| where OperationNameValue contains "Microsoft.Consumption"
| where Level == "Warning" or Level == "Error"
| where TimeGenerated > ago(24h)
Enter fullscreen mode Exit fullscreen mode

Display:

  • Subscription name
  • Current spend
  • Budget amount
  • Forecast end-of-month

What NOT to Include

❌ Resource Counts

Why NOC doesn't care: "We have 847 VMs" doesn't help incident response

Who cares: Capacity planning team

Where it belongs: Monthly capacity review, not NOC dashboard

❌ Compliance Metrics

Why NOC doesn't care: "72% compliant with tag policy" isn't urgent at 2 AM

Who cares: Governance team

Where it belongs: Weekly governance report

❌ Cost Breakdown Charts

Why NOC doesn't care: "Compute is 45% of spend" doesn't help fix outages

Who cares: FinOps team

Where it belongs: Monthly cost review

❌ "Healthy" Status

Why NOC doesn't care: If it's working, they don't need to see it

Better: Only show failures. If dashboard is empty, everything's fine.

Real NOC Dashboard Example

Our 5-tile dashboard:

  1. Critical Alerts (red box, top-left)

    • Currently: 0
    • If >0: Shows alert details
  2. Service Health (orange box, top-right)

    • Currently: 1 (Azure DevOps degraded, East US)
    • Impact: Low
  3. Failed Deployments (yellow box, middle-left)

    • Last 24h: 3 failures
    • Links to logs
  4. High CPU VMs (yellow box, middle-right)

    • Currently: 2 VMs over 85%
    • Shows VM names, current %
  5. Budget Status (green box, bottom)

    • 67% of monthly budget used
    • 45% of month elapsed
    • Forecast: On track

Total tiles: 5

Time to understand status: 10 seconds

Dashboard Refresh Strategy

Real-Time Data (1-minute refresh)

  • Critical alerts
  • Service health
  • High CPU/memory

Near Real-Time (5-minute refresh)

  • Failed deployments
  • Error logs
  • Network issues

Hourly Refresh

  • Budget status
  • Backup failures
  • Compliance alerts

Common Mistakes

❌ Mistake #1: Too Many Tiles

Problem: 47 tiles, can't see critical issues

Fix: Maximum 10 tiles, prioritize by urgency

❌ Mistake #2: Showing "Green"

Problem: "99% of services healthy" takes space

Fix: Only show failures. Empty dashboard = everything's fine.

❌ Mistake #3: No Business Context

Problem: "VM-SQL-12 is down" (which app is that?)

Fix: Map VMs to apps in dashboard query

❌ Mistake #4: Metrics Without Baselines

Problem: "CPU is 45%" (is that normal?)

Fix: Show current vs 7-day average

The "Empty Dashboard Is Good" Philosophy

Traditional thinking: Dashboard must always show data

Better thinking: Dashboard shows PROBLEMS

Result:

  • Dashboard empty most of the time
  • When something appears, it's urgent
  • NOC knows exactly what to fix

Multi-Team Dashboard Strategy

Don't: One dashboard for everyone

Do: Separate dashboards per team:

NOC Dashboard

  • Incidents requiring immediate action
  • 5 tiles, 10-second understanding

FinOps Dashboard

  • Cost trends
  • Budget tracking
  • Reservation coverage

Security Dashboard

  • Security alerts
  • Compliance violations
  • Vulnerability scans

Capacity Dashboard

  • Resource utilization
  • Growth trends
  • Forecast capacity needs

Full Dashboard Templates

Complete KQL queries, Azure Monitor Workbook templates, and multi-team dashboard architecture:

👉 Azure NOC Dashboard Complete Guide


Building dashboards for NOC teams? Show problems, not status. Empty dashboard = everything's working. That's success.

Top comments (0)