137Foundry

Posted on Jun 5

Setting Up DORA Metrics in GitHub Actions Without Buying Anything

#github #devops #webdev #productivity

DORA metrics (Deployment Frequency, Lead Time for Changes, Mean Time to Restore Service, and Change Failure Rate) are the most reliably validated software delivery metrics available, and they can be measured for a team using only GitHub Actions, a single SQLite database, and some bookkeeping. There is no need to buy a commercial DevOps analytics platform for a team that wants to start measuring these metrics.

This is a practical walkthrough of how to instrument the four metrics for a project hosted on GitHub, with explicit notes on the gotchas that show up when the instrumentation goes from "working on one service" to "working across an organization."

Photo by WAVYVISUALS on Pexels

Step 1: Define What "Deploy" Means

The first decision is what counts as a deploy. The answer should be unambiguous and consistent across the team.

The most common conventions:

A successful merge to the main branch counts as a deploy, because main is wired to auto-deploy to production.
A successful workflow run on a release tag counts as a deploy.
A successful run of a specific deployment workflow (one that ships to production, not staging) counts as a deploy.

Pick one and commit to it. The metric is meaningless if "deploy" sometimes means one thing and sometimes another. The GitHub Actions documentation covers workflow triggers in detail, which is the reference to read when codifying the choice.

For a team with multiple services in a monorepo, each service has its own definition. A merge to main triggers different workflows for different services, and the deployment metric is per-service.

Step 2: Capture Deployment Frequency

GitHub Actions records every workflow run in the workflow run history, accessible through the GitHub API. To capture deployment frequency, query the workflow run history for the deployment workflow and count successful completions over a rolling window (typically 30 days).

The minimal implementation:

Set up a scheduled GitHub Actions workflow that runs daily.
The workflow calls the GitHub REST API to fetch workflow runs for the deployment workflow in the last 30 days.
Filter to successful runs with the deploy criteria from Step 1.
Count them, divide by 30, and store the resulting "deploys per day" figure in a database (SQLite, Postgres, or whatever the team uses for ops data).

The query returns paginated results, so the daily job needs to handle pagination. The GitHub API rate limits are generous for personal access tokens and even more generous for GitHub App tokens, so rate limiting is rarely a constraint at the daily-query cadence.

For a starting threshold, the DORA research published at dora.dev categorizes deployment frequency clusters: low performers deploy fewer than once per month, medium 1-7 times per week, high 1-7 times per day, elite multiple times per day.

Step 3: Capture Lead Time for Changes

Lead time for changes is the time from commit authorship to the commit reaching production. This requires linking each deploy to the commits that were included in it.

GitHub Actions provides the deploying commit SHA in the workflow context (github.sha), and the GitHub API exposes the commit metadata (author timestamp, committer timestamp, parents). The instrumentation:

At the end of each successful deployment workflow run, record the SHA being deployed and the deploy completion timestamp.
Use the GitHub API's "list commits between two refs" endpoint to find all commits included in this deploy that were not in the previous deploy.
For each commit, the lead time is deploy_completion_timestamp - commit_author_timestamp.
Store the per-commit lead times. Aggregate to median and 90th percentile over a rolling window.

The 90th percentile is usually more interesting than the median because it reveals the long-tail commits that took unusually long to ship, which is where most lead time pain lives. The median can be misleading on its own if the average commit is fast but a meaningful subset takes weeks.

Step 4: Capture Change Failure Rate

Change failure rate is the percentage of deploys that result in a user-impacting failure. The challenge is defining what counts as a failure.

A practical definition: a deploy is a "failure" if it required a rollback, a hotfix deploy within 24 hours that addressed an introduced bug, or generated an incident ticket of severity X or higher within 24 hours of the deploy.

The mechanics:

Tag rollback workflow runs with a rollback label and treat the prior deploy as a failure.
Tag hotfix workflow runs with a hotfix reference to the failed deploy SHA.
Integrate with the incident tracker (PagerDuty, Opsgenie, Linear, Jira, or whatever the team uses) to flag deploys correlated with new incidents.

The integration with the incident tracker is the part most teams under-invest in. Even a manual flagging step (a checkbox on the incident postmortem that says "did a deploy in the last 24 hours contribute to this?") provides enough signal to make the metric meaningful.

DORA's classification thresholds for change failure rate are: low performers 46-60 percent, medium 16-30 percent, high 16-30 percent (yes, the medium and high overlap in the latest report), elite 0-15 percent.

Step 5: Capture Mean Time to Restore Service

MTTR is the time from a user-impacting incident being detected to the incident being resolved. This requires data from the incident tracker more than from CI/CD.

For teams using PagerDuty, the PagerDuty API documentation covers the endpoints needed to fetch incident timestamps. The instrumentation:

Query the incident system daily for incidents resolved in the last 30 days.
For each incident, the MTTR contribution is resolved_at - created_at.
Aggregate to median and 90th percentile.

DORA categorizes MTTR clusters: low performers more than 6 months, medium 1 day to 1 week, high less than 1 day, elite less than 1 hour.

Step 6: Persist and Visualize

The four metrics need to be persisted to a data store that supports time-series queries. SQLite works for a single team starting out; Postgres or a managed time-series database scales further if needed.

A minimal schema:

CREATE TABLE deploys (
  id TEXT PRIMARY KEY,
  service TEXT,
  sha TEXT,
  deployed_at DATETIME,
  workflow_run_id TEXT,
  failed BOOLEAN DEFAULT FALSE
);

CREATE TABLE commits_in_deploys (
  deploy_id TEXT,
  commit_sha TEXT,
  author_timestamp DATETIME,
  lead_time_seconds INTEGER
);

CREATE TABLE incidents (
  id TEXT PRIMARY KEY,
  service TEXT,
  detected_at DATETIME,
  resolved_at DATETIME,
  mttr_seconds INTEGER
);

Visualizations can use any dashboarding tool that connects to the database. Grafana works well for time-series and is free for self-hosting. A static page generated nightly from the database is even simpler and easier to share inside the team.

Common Pitfalls

A few patterns that catch teams setting this up for the first time:

Counting staging deploys. Only production deploys count. Staging deploys are not user-facing and including them inflates the deployment frequency number without describing real delivery cadence.

Counting failed workflow runs. Only successful runs that actually shipped count. A workflow run that failed during the build step did not deploy anything.

Treating multiple-PR deploys as single-commit lead time. When ten commits ship in one deploy, each commit's lead time is from its own authorship to the deploy. Using the deploy's own start time instead of the commit's authorship time understates lead time significantly.

Ignoring per-service variation. Aggregating across an entire monorepo hides slow services behind fast ones. Per-service measurement is more work but produces more actionable data.

Forgetting about weekends and holidays. Lead time can spike during quiet periods because commits sit waiting for the next deploy. Calculating "business hours only" lead time is more honest for teams with weekend-quiet deploy cadences.

Photo by Brett Sayles on Pexels

What 137Foundry Builds Around This

The instrumentation above is sufficient for a single team to get started, but organizations with multiple teams, multiple deployment platforms, and complex incident workflows typically want more sophisticated integration. The data integration between CI/CD systems and incident trackers is one of the areas covered by https://137foundry.com when teams want a unified view across services and want to surface DORA metrics in a way leadership will actually engage with.

The longer reference on what to do with the metrics once instrumented is How to Establish Engineering Productivity Metrics That Drive Real Improvements on the 137Foundry blog. It covers the framework for choosing metrics, the role of qualitative data alongside the four DORA metrics, and the failure modes of common alternatives.

The full 137Foundry services hub covers the engagements that come up around this work, including the data integration and internal developer platform work that supports reliable DORA measurement.

The point of measuring is to act on the result. Teams that set up DORA metrics and watch them stay flat are typically not investing in the underlying delivery system; the metric is a thermometer, and the fever does not break unless someone takes medicine. The first month of measurement is for establishing the baseline. The investment that follows is where the metric earns its place on the dashboard.