DEV Community

Yash Pritwani
Yash Pritwani

Posted on • Originally published at techsaas.cloud

DORA Metrics: A Platform Engineering Dashboard

Originally published on TechSaaS Cloud



title: "DORA Metrics for Platform Engineering: What Your Dashboard Should Actually Measure"
slug: dora-metrics-platform-engineering-dashboard
category: Platform Engineering
tags: [DORA Metrics, Platform Engineering, Developer Productivity, DevOps, SRE]
seo_title: "DORA Metrics Guide 2026: Platform Engineering Dashboard That Works"
meta_description: "Why most DORA metrics dashboards are misleading and how to build one that actually drives improvement. Covers deployment frequency, lead time, MTTR, and change failure rate with Grafana examples."

estimated_read_time: 10

DORA Metrics for Platform Engineering: What Your Dashboard Should Actually Measure

Every platform engineering team has a DORA metrics dashboard. Most of them are lying.

Deployment frequency of 47/day looks great until you realize 40 of those are config changes to a feature flag service. Lead time of 2 hours looks fast until you realize it's measuring time from merge to deploy, not time from first commit to production.

Here's how to build a DORA dashboard that actually tells you something useful.

The Four Metrics (And What They Actually Mean)

1. Deployment Frequency

What people measure: COUNT(deployments) / time
What you should measure: COUNT(meaningful_deployments) / time

A meaningful deployment changes user-facing behavior. Config changes, dependency bumps, and CI fixes don't count.

# Bad: counts everything
sum(increase(deployments_total[24h]))

# Better: filter by deployment type
sum(increase(deployments_total{type="feature"}[24h]))
+ sum(increase(deployments_total{type="bugfix"}[24h]))
Enter fullscreen mode Exit fullscreen mode

2. Lead Time for Changes

What people measure: Merge to deploy
What you should measure: First commit to production traffic

The time from a developer's first commit to when real users hit the new code. This captures code review wait time, CI queue time, staging validation, and rollout duration — all the friction your platform creates.

# Capture the full pipeline
histogram_quantile(0.50,
  sum(rate(lead_time_seconds_bucket{
    stage="first_commit_to_production"
  }[7d])) by (le)
)
Enter fullscreen mode Exit fullscreen mode

3. Change Failure Rate

What people measure: failed_deploys / total_deploys
What you should measure: deploys_causing_degradation / total_deploys

A deployment that fails CI and never reaches production isn't a change failure — it's CI working correctly. A deployment that passes everything but causes a 10% error rate spike IS a change failure.

4. Mean Time to Recovery (MTTR)

What people measure: Time from alert to resolution
What you should measure: Time from user impact to user recovery

If your alerting has 15 minutes of lag, your MTTR looks 15 minutes better than reality. Measure from the moment error rates spike, not from when PagerDuty fires.

The Dashboard That Works

Panel 1: Weekly Deployment Velocity

  • Line chart: deployments per week, split by type (feature, bugfix, infra)
  • Exclude: config changes, dependency updates, CI fixes
  • Annotation: mark release freezes, incidents, holidays

Panel 2: Lead Time Distribution

  • Heatmap: lead time buckets (hours) over past 30 days
  • Show p50, p75, p95 — not just average
  • Split by team if multi-team org

Panel 3: Change Failure Rate Trend

  • Stacked bar: successful deploys vs. failure-causing deploys per week
  • Overlay: change failure rate as percentage line
  • Alert threshold at 15% (DORA "high" performer benchmark)

Panel 4: MTTR by Severity

  • Bar chart: average MTTR split by incident severity (SEV1-4)
  • Include: detection time, triage time, fix time, verification time
  • Goal lines: SEV1 < 1hr, SEV2 < 4hr, SEV3 < 24hr

Panel 5: Platform Health Score

Composite metric combining all four DORA metrics into a single score:

score = (
  normalize(deployment_freq, target=daily) * 0.25 +
  normalize(1/lead_time_hours, target=24h) * 0.25 +
  normalize(1-change_failure_rate, target=0.85) * 0.25 +
  normalize(1/mttr_hours, target=1h) * 0.25
)
Enter fullscreen mode Exit fullscreen mode

Common Anti-Patterns

1. Gaming the Metrics

Teams split PRs into tiny changes to inflate deployment frequency. Fix: measure feature completion rate alongside deployment frequency.

2. Measuring Teams Against Each Other

DORA metrics are for teams to improve themselves, not for management to rank teams. Different services have legitimately different deployment profiles.

3. Ignoring Context

A team with 0 deployments during a security incident investigation isn't underperforming — they're doing the right thing. Always annotate metric dashboards with context.

4. Snapshot Obsession

Looking at this week's numbers in isolation tells you nothing. The trend over 3-6 months is what matters.

Implementation: Data Sources for Real DORA

The metrics above are only as good as the data feeding them. Here's where to get each metric:

Deployment Frequency:

  • Source: CI/CD pipeline events (GitHub Actions webhook, ArgoCD notifications, Flux alerts)
  • Label each deployment with type: feature, bugfix, config, dependency, infra
  • Push to Prometheus via pushgateway or use a deployment tracker service

Lead Time:

  • Source: Git events (first commit timestamp) + deployment events (production rollout timestamp)
  • Calculate: production_deploy_time - first_commit_time for each PR/branch
  • Tools: LinearB, Sleuth, or custom webhook that tracks PR lifecycle

Change Failure Rate:

  • Source: Incident tracking (PagerDuty, Opsgenie) correlated with deployment events
  • Logic: if incident starts within 1 hour of deployment AND affects the deployed service, count as change failure
  • This correlation is the hardest part — most teams get it wrong because they don't link incidents to deploys

MTTR:

  • Source: Monitoring (Prometheus alertmanager) for impact start, incident tracker for resolution
  • Measure from first error rate spike (detected by anomaly detection), not from alert firing
  • Include: detection lag, triage time, fix time, verification time as separate sub-metrics

SPACE Framework: Beyond DORA

DORA measures delivery performance. SPACE (from Microsoft Research) adds developer experience:

  • Satisfaction and well-being (quarterly survey, eNPS score)
  • Performance (DORA metrics as described above)
  • Activity (commits, PRs, reviews — use carefully, never as productivity proxy)
  • Communication and collaboration (PR review turnaround, async response time)
  • Efficiency and flow (focus time from calendar analysis, context switches from tool telemetry)

The combination of DORA (system performance) + SPACE (human experience) gives you the full picture. A team with elite DORA metrics but 30% satisfaction is one resignation away from collapse.

Our Recommendation

Start with just two metrics: deployment frequency (filtered by type) and change failure rate. These are the easiest to instrument and the most actionable. Add lead time once you have the data pipeline working. Add MTTR when you have incident tracking mature enough to correlate with deploys.

The dashboard is not the goal. The goal is a team that ships faster with fewer failures. The dashboard just makes the trend visible so you can have evidence-based conversations about where to invest in your platform.


Want help building a DORA metrics dashboard that actually drives improvement? Book a free platform engineering consultation or explore our DevOps services.

Top comments (0)