DEV Community

Cover image for I Pulled 3 Months of Engineering Metrics on Our AI Tools - Here's the Dashboard Cell Nobody Built
Mykola Kondratiuk
Mykola Kondratiuk

Posted on

I Pulled 3 Months of Engineering Metrics on Our AI Tools - Here's the Dashboard Cell Nobody Built

I Pulled 3 Months of Engineering Metrics on Our AI Tools - Here's the Dashboard Cell Nobody Built

Last week our team got the same question I bet half of you got: "what's the ROI on the AI tools we adopted in Q1?"

The CFO asked engineering. Engineering pointed at the PM retro. The PM retro had a row that said "team velocity feels higher" and a row that said "developers report subjective time savings." That was the data.

Meanwhile a fresh enterprise survey out of ExcelMindCyber says 73% of companies will fail to deliver promised ROI on AI investments this year. I read that and thought: of course. The dashboard for the question doesn't exist.

What the Repo Already Knows

Pull request throughput. Time-to-first-review. Cycle time from open to merge. Build duration. CI flake rate. Incident count. Mean time to recovery. PR size distribution.

We ship all of these. Most teams stream them into a Grafana board or a Linear analytics view or a custom dbt model on top of GitHub events. The data is in the repo. The data is in the CI logs. The data is in the deploy pipeline.

What we don't ship is the cell that says "for the workflow we adopted Tool X for, what changed."

That sounds trivial. It is not.

The Cell That Doesn't Exist

To answer the cell honestly you need three things in one row:

The workflow boundary. Not the tool boundary. "PR review" is a workflow. "Tool X" is a tool. The same tool can land in three workflows and change two of them. You need the join key to be the workflow.

The before-window. A baseline of the metric for that workflow before the tool landed. Not the team-wide cycle time. The cycle time on the specific class of work the tool was supposed to change.

The behavior signal. Did engineers actually use the tool inside the workflow, or did they sign up, click around once, and route around it. We have user-event telemetry for our own product. We rarely have it for the AI tool we just bought.

Without those three columns, the dashboard answers a different question. It answers "did we deploy the tool" not "did the workflow change."

What I Tried First (and Why It Failed)

The first version of the cell I built was a simple compare. Cycle time on PRs in February versus cycle time on PRs in April. Tool landed mid-March.

Numbers looked good. Cycle time was down 14%. I almost shipped it.

Then I segmented by PR class. Refactor PRs were down 22%. Bug-fix PRs were flat. Feature PRs were up 4%. The aggregate hid three completely different stories.

Then I looked at tool usage. Half the team had opened the tool fewer than three times in 30 days. The 14% improvement was carried by four developers. The rest of the team was running the same workflow without the tool and getting roughly the same numbers.

The honest answer to the CFO question wasn't "the tool drove a 14% improvement." It was "four developers got real value, the rest haven't adopted it yet, and we don't have the playbook for the rest."

If I had shipped the v1 number, the next quarter's budget cycle would have used it as proof. Then we would have spent more on the same shape of tool, and gotten a smaller delta, because the developers who would benefit had already adopted.

What Engineering Owns Here

I keep hearing the framing that "this is a PM problem." It isn't, or rather, it isn't only.

The PM retro happens after the quarter. The dashboard cell happens continuously. If engineering owns the metrics that say "this tool changed this workflow for these developers by this much," the PM gets a starting point that isn't fiction. If engineering owns nothing, the PM writes the retro on vibes and the CFO funds the next round on vibes.

The tools we already use give us most of what we need. GitHub events. Linear events. Tool-specific webhooks where they exist. A small dbt model that defines workflow boundaries explicitly. A heartbeat metric on tool usage at the user level.

The piece nobody is building is the join. Workflow x tool x usage x outcome. Four columns. Most teams have one.

The Smallest Version Worth Shipping

A single materialized view. Per workflow, per AI tool, per developer:

select
  workflow_id,
  tool_id,
  developer_id,
  date_trunc('week', event_at) as week,
  -- usage signal
  count(distinct case when event_type = 'tool_invocation' then event_id end) as tool_uses,
  -- workflow outcome
  avg(cycle_time_minutes) as cycle_time_avg,
  count(distinct case when event_type = 'workflow_completion' then event_id end) as completions
from workflow_events
join tool_events using (developer_id, workflow_id, week)
group by 1, 2, 3, 4;
Enter fullscreen mode Exit fullscreen mode

Five columns. One join. Then a Grafana panel that shows cycle_time_avg split by tool_uses bucket (zero, low, high). The panel answers the question: "for the developers who actually use the tool, did the workflow get faster, and by how much."

The first time I ran ours, the bucket comparison was the most honest 30 seconds of the quarter. It told me which tools had earned their seat and which were budget items pretending to be productivity gains.

Honest Limit

This dashboard cell does not answer whether the tool was worth the money. That requires a price tag, a discount rate, an opportunity-cost guess. That part is genuinely a CFO conversation.

What the cell does answer is whether the workflow changed at all. Without that, the CFO conversation is fiction. With it, the conversation is at least a real conversation.

What's the cell your team has built and your CFO doesn't know about yet?

Top comments (1)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

honestly the materialized-view sketch breaks the moment your AI tool doesn't emit usage webhooks. half the IDE-embedded tools we tried in Q1 had zero per-developer telemetry, and the proxy signals (cmd-K invocations, file-edit deltas) were noisy enough that the bucket comparison stopped meaning anything. the cell is real but the join key isn't free for every tool yet.