Mykola Kondratiuk

Posted on Apr 29

I Pulled 3 Months of Engineering Metrics on Our AI Tools - Here's the Dashboard Cell Nobody Built

#ai #devops #productivity #career

The CFO asked engineering. Engineering pointed at the PM retro. The PM retro had a row that said "team velocity feels higher" and a row that said "developers report subjective time savings." That was the data.

Meanwhile a fresh enterprise survey out of ExcelMindCyber says 73% of companies will fail to deliver promised ROI on AI investments this year. I read that and thought: of course. The dashboard for the question doesn't exist.

What the Repo Already Knows

Pull request throughput. Time-to-first-review. Cycle time from open to merge. Build duration. CI flake rate. Incident count. Mean time to recovery. PR size distribution.

We ship all of these. Most teams stream them into a Grafana board or a Linear analytics view or a custom dbt model on top of GitHub events. The data is in the repo. The data is in the CI logs. The data is in the deploy pipeline.

What we don't ship is the cell that says "for the workflow we adopted Tool X for, what changed."

That sounds trivial. It is not.

The Cell That Doesn't Exist

To answer the cell honestly you need three things in one row:

The workflow boundary. Not the tool boundary. "PR review" is a workflow. "Tool X" is a tool. The same tool can land in three workflows and change two of them. You need the join key to be the workflow.

The before-window. A baseline of the metric for that workflow before the tool landed. Not the team-wide cycle time. The cycle time on the specific class of work the tool was supposed to change.

The behavior signal. Did engineers actually use the tool inside the workflow, or did they sign up, click around once, and route around it. We have user-event telemetry for our own product. We rarely have it for the AI tool we just bought.

Without those three columns, the dashboard answers a different question. It answers "did we deploy the tool" not "did the workflow change."

What I Tried First (and Why It Failed)

The first version of the cell I built was a simple compare. Cycle time on PRs in February versus cycle time on PRs in April. Tool landed mid-March.

Numbers looked good. Cycle time was down 14%. I almost shipped it.

Then I segmented by PR class. Refactor PRs were down 22%. Bug-fix PRs were flat. Feature PRs were up 4%. The aggregate hid three completely different stories.

Then I looked at tool usage. Half the team had opened the tool fewer than three times in 30 days. The 14% improvement was carried by four developers. The rest of the team was running the same workflow without the tool and getting roughly the same numbers.

The honest answer to the CFO question wasn't "the tool drove a 14% improvement." It was "four developers got real value, the rest haven't adopted it yet, and we don't have the playbook for the rest."

If I had shipped the v1 number, the next quarter's budget cycle would have used it as proof. Then we would have spent more on the same shape of tool, and gotten a smaller delta, because the developers who would benefit had already adopted.

What Engineering Owns Here

I keep hearing the framing that "this is a PM problem." It isn't, or rather, it isn't only.

The PM retro happens after the quarter. The dashboard cell happens continuously. If engineering owns the metrics that say "this tool changed this workflow for these developers by this much," the PM gets a starting point that isn't fiction. If engineering owns nothing, the PM writes the retro on vibes and the CFO funds the next round on vibes.

The tools we already use give us most of what we need. GitHub events. Linear events. Tool-specific webhooks where they exist. A small dbt model that defines workflow boundaries explicitly. A heartbeat metric on tool usage at the user level.

The piece nobody is building is the join. Workflow x tool x usage x outcome. Four columns. Most teams have one.

The Smallest Version Worth Shipping

A single materialized view. Per workflow, per AI tool, per developer:

select
  workflow_id,
  tool_id,
  developer_id,
  date_trunc('week', event_at) as week,
  -- usage signal
  count(distinct case when event_type = 'tool_invocation' then event_id end) as tool_uses,
  -- workflow outcome
  avg(cycle_time_minutes) as cycle_time_avg,
  count(distinct case when event_type = 'workflow_completion' then event_id end) as completions
from workflow_events
join tool_events using (developer_id, workflow_id, week)
group by 1, 2, 3, 4;

Five columns. One join. Then a Grafana panel that shows cycle_time_avg split by tool_uses bucket (zero, low, high). The panel answers the question: "for the developers who actually use the tool, did the workflow get faster, and by how much."

The first time I ran ours, the bucket comparison was the most honest 30 seconds of the quarter. It told me which tools had earned their seat and which were budget items pretending to be productivity gains.

Honest Limit

This dashboard cell does not answer whether the tool was worth the money. That requires a price tag, a discount rate, an opportunity-cost guess. That part is genuinely a CFO conversation.

What the cell does answer is whether the workflow changed at all. Without that, the CFO conversation is fiction. With it, the conversation is at least a real conversation.

What's the cell your team has built and your CFO doesn't know about yet?

Top comments (5)

Mykola Kondratiuk • Apr 29

honestly the materialized-view sketch breaks the moment your AI tool doesn't emit usage webhooks. half the IDE-embedded tools we tried in Q1 had zero per-developer telemetry, and the proxy signals (cmd-K invocations, file-edit deltas) were noisy enough that the bucket comparison stopped meaning anything. the cell is real but the join key isn't free for every tool yet.

arun rajkumar • Apr 30

The metric we ended up trusting is "PR survival rate at 30 days" — what fraction of AI-assisted PRs were still in main without revert or rewrite a month later. Speed-of-merge looked great with agents on; survival rate told a different story until we put architectural lints and design-pattern checks in front of them. The "cell nobody builds" is usually the one that contradicts the dashboard everyone wants to show. We track time-to-rollback as a separate signal too — fast revert paths matter more once AI is generating volume, because you need to be cheap about admitting a PR was wrong.

Mykola Kondratiuk • Apr 30

pr survival rate at 30d is a much better proxy - we saw the same divergence. merge speed spikes with agents, then you spend the next sprint undoing the drift. what did you use for the architectural lints - custom rules or something like semgrep?

Tiago Nobrega • May 2

Isolating factors is a real challenge. Also, I see a lot of indicator trying to measure how fast and how much has been done. I don't see much talk about how stable and reliable the software is. I think this reflects in the quality of software today. Endless features, some of which nobody asked for, while it feel like you never get to familiarize with the product and get used to having to retry actions due to failures. What is a recent app you feel like never fails you? I can't think of many

Mykola Kondratiuk • May 2

stability metrics exist but don't get into dashboards because they're uncomfortable - you ship 40 tickets in a sprint, half land reliably, nobody wants to report that. velocity is easy to show stakeholders. reliability just makes people ask hard questions.