DEV Community

Yash
Yash

Posted on

The DevOps metrics that actually matter (and how to track them)

The DevOps metrics that actually matter (and how to track them)

The 4 DORA metrics

Deployment Frequency    Elite: Multiple/day    Low: < once/month
Lead Time for Changes   Elite: < 1 hour        Low: > 1 month
Change Failure Rate     Elite: 0-5%            Low: 15-30%
MTTR                    Elite: < 1 hour        Low: > 1 week
Enter fullscreen mode Exit fullscreen mode

Track deployment frequency in GitHub Actions

- name: Record deployment metric
  if: success()
  run: |
    aws cloudwatch put-metric-data       --namespace "DevOps/Deployments" --metric-name "DeploymentCount" --value 1       --dimensions Service=${{ env.SERVICE_NAME }},Environment=production
Enter fullscreen mode Exit fullscreen mode

Track lead time (commit → production)

- name: Record lead time
  if: success()
  run: |
    COMMIT_TIME=$(git show -s --format=%ct ${{ github.sha }})
    LEAD_TIME=$(($(date +%s) - COMMIT_TIME))
    aws cloudwatch put-metric-data       --namespace "DevOps/Deployments" --metric-name "LeadTimeSeconds"       --value $LEAD_TIME --dimensions Service=${{ env.SERVICE_NAME }}
Enter fullscreen mode Exit fullscreen mode

Track MTTR via alarm state changes (Lambda)

def handler(event, context):
    alarm = event['detail']['alarmName']
    state = event['detail']['state']['value']
    ts    = datetime.fromisoformat(event['time'].replace('Z', '+00:00'))

    if state == 'ALARM':
        ssm.put_parameter(Name=f'/incidents/{alarm}/start',
                          Value=ts.isoformat(), Type='String', Overwrite=True)
    elif state == 'OK':
        start = ssm.get_parameter(Name=f'/incidents/{alarm}/start')
        mttr  = (ts - datetime.fromisoformat(start['Parameter']['Value'])).total_seconds()
        cw.put_metric_data(Namespace='DevOps/Incidents',
                           MetricData=[{'MetricName':'MTTR','Value':mttr,'Unit':'Seconds'}])
Enter fullscreen mode Exit fullscreen mode

Leading indicators

  • Alert fatigue rate — rising = signal quality degrading
  • Deployment size — larger = higher failure rate
  • Test coverage trend — declining = more future failures

Step2Dev includes deployment metrics instrumentation in the workflow it generates.

👉 step2dev.com

Top comments (1)

Collapse
 
klement_gunndu profile image
klement Gunndu

The MTTR tracking via Lambda alarm state changes is a clean approach I haven't seen before. Most teams I've worked with track MTTR manually in spreadsheets -- this automates the painful part.