Sven Schuchardt

Posted on Apr 29 • Originally published at biztechbridge.com

How to Compute Zero Trust Effectiveness: Four Metrics That Survive a Breach

#zerotrust #security #devops #sre

Three hops captures the realistic post-compromise reach inside a typical enterprise environment. If your IAM tooling does not expose a graph, the practical substitute is "count of distinct resources the identity has permission to read or modify within 60 minutes of session start, assuming no MFA step-up triggers."

What good looks like

Privileged human identity: under 50 reachable resources, zero crown-jewel data classes without step-up
Standard human identity: under 200 reachable resources, no production data without explicit grant
Service account: scoped to a single namespace or workload — under 10 reachable resources is normal, over 100 is a problem

Report this metric per identity class, not as a single org-wide average. The average hides the outliers, and the outliers are what get exploited.

Metric 2: Lateral-movement time-to-detect

Lateral-movement TTD is the median time between an attacker's first action on a compromised host and the moment your SOC opens a case for the second host. Every Zero Trust programme implicitly claims to reduce this number. Most never measure it.

How to compute it

The easiest source is your EDR plus your SIEM. You need two timestamps per simulated or real lateral-movement event:

// Microsoft Sentinel / KQL — adapt to Splunk / Elastic / Chronicle
let lateralEvents = SecurityAlert
  | where AlertName has_any ("Pass-the-hash", "Suspicious WMI", "RDP from unusual host", "Service account used from new asset")
  | project firstHopTime = TimeGenerated, firstHost = CompromisedEntity, alertId = SystemAlertId;
let secondHopAlerts = SecurityAlert
  | where AlertName has_any ("Suspicious lateral connection", "Credential reuse on new host")
  | project secondHopTime = TimeGenerated, secondHost = CompromisedEntity, correlationId = SystemAlertId;
lateralEvents
  | join kind=inner (secondHopAlerts) on $left.alertId == $right.correlationId
  | extend ttd_minutes = datetime_diff('minute', secondHopTime, firstHopTime)
  | summarize p50 = percentile(ttd_minutes, 50), p90 = percentile(ttd_minutes, 90)

If you are not running purple-team exercises that produce real lateral-movement signal, your TTD is technically infinite — and that is the metric you should report. Quarterly attack simulations are the cheapest way to populate this number honestly.

What good looks like

Mature programme: p50 under 10 minutes, p90 under 30 minutes
Functional programme: p50 under 60 minutes, p90 under 4 hours
Untested programme: unknown — and "unknown" is a board-grade red flag

The IBM 2025 Cost of a Data Breach Report shows breaches contained in under 200 days cost $1.14M less on average than slower ones. Lateral-movement TTD is the leading indicator that determines containment time.

Metric 3: Service-account scope drift

Human identities have managers, review cycles, and offboarding. Service accounts and machine identities have none of these by default — and they outnumber human identities roughly 82 to 1 in a typical enterprise. Scope drift measures how their permissions change quarter over quarter without explicit human approval.

How to compute it

-- Compare snapshot of service-account permissions across two points in time
WITH current_perms AS (
  SELECT identity_id, permission, granted_at
  FROM iam_permissions_snapshot
  WHERE snapshot_date = CURRENT_DATE
    AND identity_type = 'service_account'
),
baseline_perms AS (
  SELECT identity_id, permission
  FROM iam_permissions_snapshot
  WHERE snapshot_date = CURRENT_DATE - INTERVAL '90 days'
    AND identity_type = 'service_account'
),
drift AS (
  SELECT
    c.identity_id,
    c.permission,
    c.granted_at,
    CASE
      WHEN EXISTS (SELECT 1 FROM change_approvals a
                   WHERE a.identity_id = c.identity_id
                     AND a.permission = c.permission
                     AND a.approved_at BETWEEN c.granted_at - INTERVAL '7 days'
                                            AND c.granted_at)
      THEN 'approved'
      ELSE 'unapproved'
    END AS approval_status
  FROM current_perms c
  LEFT JOIN baseline_perms b
    ON c.identity_id = b.identity_id AND c.permission = b.permission
  WHERE b.permission IS NULL  -- new permission since baseline
)
SELECT approval_status, COUNT(*) AS new_perms
FROM drift
GROUP BY approval_status;

The number you report is the count of unapproved new permissions per quarter, plus the top ten service accounts that gained the most scope.

What good looks like

Quarterly unapproved drift: under 5% of total permission changes
Zero service accounts in the top-ten that touch crown-jewel data classes
Every "approved" entry traces to a ticket or change record

Anything above 15% unapproved drift means your IAM hygiene has decayed, regardless of how many controls you have deployed.

Metric 4: Exception age

Every Zero Trust programme accumulates exceptions: the legacy app that cannot do MFA, the build server that needs a static credential, the compliance carve-out for a specific business unit. These are unavoidable. What is not unavoidable is letting them age.

Exception age is the median number of days an active policy exception has been in production.

How to compute it

The exception register is your source of truth. It needs three fields per entry: opened date, business owner, and committed remediation date. The query is trivial:

SELECT
  exception_category,
  COUNT(*) AS active_count,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY DATE_PART('day', NOW() - opened_at)) AS p50_age_days,
  PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY DATE_PART('day', NOW() - opened_at)) AS p90_age_days,
  COUNT(*) FILTER (WHERE remediation_committed_at < NOW()) AS overdue_count
FROM zt_exceptions
WHERE status = 'active'
GROUP BY exception_category
ORDER BY p90_age_days DESC;

If you do not have an exception register, that is the metric you should report: "number of policy exceptions tracked: zero — and we know that is wrong."

What good looks like

Median exception age: under 90 days
p90 exception age: under 180 days
Overdue (past committed remediation date): zero
Every entry has a named human owner, not a team distribution list

The most uncomfortable version of this metric is the expired exception count — exceptions whose stated business justification is no longer true but which remain in production because nobody owns the cleanup. Surface that number deliberately.

Putting the four metrics together

The four metrics tell a coherent story when reported together:

Pattern	Diagnosis
Blast radius high, TTD low	Detection is fast but identity scope is too broad. Tighten least-privilege.
Blast radius low, TTD high	Containment is structurally sound but observability is weak. Invest in EDR + UEBA.
Drift high, exception age low	New permissions outpace cleanup. Tighten IAM change control.
Drift low, exception age high	Stable IAM, but the exception register is a parking lot. Force re-justification quarterly.
All four red	The programme is doing activity work. Stop deploying and start measuring.

Notice none of these four metrics are coverage percentages. None of them go up just because you bought a tool. Every one of them requires a human to make a decision about whether the current number is acceptable — which is the entire point.

What to put on the board slide

Translate the four metrics into the only sentence the board cares about:

"If an attacker compromises one identity tomorrow, the blast radius is N systems containing C crown-jewel data classes, our median time to detect a second hop is T minutes, and we currently carry E policy exceptions with a median age of A days."

That single sentence is the dashboard. Everything else — the rings, the percentages, the heatmaps — is supporting evidence. If you cannot answer it from your current tooling in under five minutes, the gap is not a tooling gap. It is a measurement-discipline gap, and no amount of additional Zero Trust deployment will close it.

Closing

Zero Trust is a security discipline that lives or dies by what you measure. Activity metrics make the programme look healthy in year one and vanish in year two when the breach happens anyway. Effectiveness metrics are uglier, harder to compute, and they survive contact with reality.

Pick the four. Compute them honestly. Report the awkward numbers alongside the impressive ones. The CISOs getting real budget in 2026 are the ones whose dashboards make leadership uncomfortable on purpose — because uncomfortable numbers are the only ones a board can act on.

Originally published at biztechbridge.com. For the strategic framing of these metrics in board reporting, see Measuring Zero Trust: The Dashboard Your Board Wants to See.

DEV Community

How to Compute Zero Trust Effectiveness: Four Metrics That Survive a Breach

What good looks like

Metric 2: Lateral-movement time-to-detect

How to compute it

What good looks like

Metric 3: Service-account scope drift

How to compute it

What good looks like

Metric 4: Exception age

How to compute it

What good looks like

Putting the four metrics together

What to put on the board slide

Closing

Top comments (0)