Mykola Kondratiuk

Posted on Jun 26

I Built a 5-Metric Dashboard for AI's Impact on My Team - Here's the One I'd Refuse to Skip

#discuss #ai #leadership #productivity

A US state shipped an observability tool for AI yesterday and most engineers didn't notice.

California turned on a monthly tracker (June 25) that links AI-exposure data to unemployment claims, so the state can watch AI's effect on the workforce in near real time. It's a macro dashboard for a thing nobody was instrumenting before.

You already know where I'm going with this. You instrument your services. Latency, error rate, saturation, the whole golden-signals stack. You'd never run a system in prod with no dashboard. So why is the AI workforce inside your own org running completely uninstrumented?

The metrics we actually have are the wrong ones

Pull up how most teams "measure AI" today. Two dashboards, tops.

One is spend. Tokens, seats, API bills. Useful for finance, tells you nothing about impact.

The other is model evals. Benchmark scores, pass rates on a test set. Useful for the model, tells you nothing about your team.

Neither one answers the question an exec is about to start asking: what is this actually doing to how the work gets done? That's not a cost question and it's not a benchmark question. It's an observability question, and it's ours to own before it gets handed to someone who'll measure it badly.

So I sketched the dashboard I'd build. Five signals, one screen. Here's the shape:

# ai-workforce-dashboard.yaml
metrics:
  agent_task_coverage:
    desc: "% of a real workflow's steps running on agents"
    query: agent_steps / total_steps
    note: "measure the Tuesday reality, not the demo"

  role_reclassification_rate:
    desc: "roles that changed shape vs roles eliminated"
    query: reclassified / (reclassified + eliminated)
    note: "the augmentation signal — counts shift, not loss"

  upskilling_completion:
    desc: "% of changed-role people who finished training"
    query: completed / affected

  error_escalation_rate:
    desc: "agent outputs caught + kicked back to a human"
    query: escalated / agent_outputs
    note: "your reliability trend — rising is trouble"

  output_vs_headcount_delta:
    desc: "is output growing faster than the team?"
    query: d(output) - d(headcount)
    refuse_to_skip: true

The two metrics a dev will actually trust

error_escalation_rate is the one that should feel familiar. It's just your error budget, pointed at agents instead of services. An agent ships output, a human catches a bad one and escalates, you log it. The ratio over time is a trust curve. Falling means you can safely hand the agent more. Rising means something regressed and you're about to find out in production.

The honest part: this one is hard to instrument cleanly, because "a human caught it" isn't always a logged event. The first version is manual and noisy. I'd still rather have a rough trend line than pretend reliability is fine because no one filed a ticket.

output_vs_headcount_delta is the one I'd refuse to skip. It's the only line on the screen that ties everything to economics in one number. Output growing faster than the team is the entire argument for the budget you're spending, and it's the number a non-technical board will actually read.

Why reclassification is the metric that keeps it honest

The peg here is a job-displacement tracker, and the lazy version of this article is "here's how to watch the layoffs arrive." I think that frame is wrong, and role_reclassification_rate is why.

When you measure roles that changed shape instead of only roles that vanished, you catch what the doom story misses. The reviewer who now supervises an agent and owns the judgment calls didn't get deleted. Their job moved up the stack. If your dashboard only counts headcount, every shift looks like loss. If it counts reclassification, you can see the work redistributing and actually manage it.

That's the difference between an instrument and an alarm. I want the instrument.

You already own observability. Extend it.

None of this needs a new platform. It needs the discipline you already apply to your services, pointed one layer up at how agents and people are splitting the work. Five queries, one board, checked weekly.

If you had to ship this dashboard with three metrics instead of five, which two would you cut, and which one would you defend in the room?

Top comments (16)

FastAnchor_io • Jun 29

You can't skip the AI's planning because it is extracting data rather than integrating it. A breakpoint in your process will only make you seem unprofessional and increase the cost.

Mykola Kondratiuk • Jun 29

not sure i follow the breakpoint framing - the metrics in the article are post-process, not inline instrumentation. the planning phase runs uninterrupted. what setup are you seeing that adds overhead here?

FastAnchor_io • Jun 30

Simply put, the work do we is relatively flexible, while the planning of AI is more rule - based. We can define your dashboard as a five - step process but, AI will define it as a six - to seven - step process. Unless you keep adjusting it to five steps, during execution, AI still takes more steps. It's just that it's artificially defined as five steps instead of its own planning.

Mykola Kondratiuk • Jun 30

that gap between the five steps you report and the six-seven the model runs is the instrumentation problem. you're measuring the declared process, not the actual one.

FastAnchor_io • Jun 30

Maybe. this Perhaps is the difference in the execution process of AI in China and foreign countries. However, I think it may mainly involve the issue of prompt words.

Mykola Kondratiuk • Jun 30

yeah, the prompt defines what the model even counts as a step - so prompt structure and the measurement gap are the same problem from different angles.

FastAnchor_io • Jun 30

Therefore, I feel that when we define things, we control them in the form of large-scale planning, small-scale decomposition, and automatic completion. However, the problem of token consumption will arise.

Mykola Kondratiuk • Jun 30

token cost is what ends up driving decomposition granularity more than the task structure. you design these planning hierarchies and then context overhead is eating 2-3x the actual work tokens. so you decompose around cost thresholds instead of logical units. the two collide and the economics usually win.

FastAnchor_io • Jul 1

You are right, which is why we need to control it now when designing, so this is also the point I need to pay attention to.

Mykola Kondratiuk • Jul 1

designing around it upfront is really the only clean window - once the hierarchy is running, the cost is already baked in.

FastAnchor_io • Jul 1

Yes, I agree with the statement. That's why I've been emphasizing this point repeatedly.

Mykola Kondratiuk • Jul 1

the compounding part is the sneaky one — every agent added after the hierarchy solidifies makes retrofitting 2x harder, not linearly. upfront design buys an unlock; late design buys a refactor.

FastAnchor_io • Jul 2

Your response is correct. My current understanding is the same. By out sorting the hierarchical structure, the original rules are solidified.

Mykola Kondratiuk • Jul 2

yeah, the solidification problem is basically inevitable - you end up with implicit hierarchy even when nothing in the spec says so.

TuanPK Builds • Jun 26

One metric I'd add is human intervention. If AI-generated work still requires extensive review or rework, the productivity gains may be smaller than they appear. Measuring both speed and quality gives a much clearer picture.

Mykola Kondratiuk • Jun 28

human intervention is the one I found hardest to define consistently - what counts? quick skim-and-approve? flagged for major rework? my team tracked it differently week to week and the trend was noise. rework rate gave us cleaner signal.

View full discussion (16 comments)