The AI Industry Is Measuring the Wrong Thing. Here Are the 6 Metrics That Actually Matter.

Meenakshi Sundaram. C — Fri, 24 Apr 2026 06:13:26 +0000

Every month, the same scene plays out inside AI product teams around the world.

The OpenAI invoice arrives. It's higher than expected — sometimes significantly higher. Someone shares their screen. Everyone squints at the same provider dashboard. The same question hangs in the air:

"What the hell caused this?"

The dashboard shows totals. It shows models. It shows requests. It does not show you what you actually need to know: which customer triggered the spike, which agent is bleeding your budget, or whether any of this spending is actually generating a return.

This is the state of LLM observability in 2026. An entire industry of tools — Helicone, Langfuse, LangSmith, Braintrust, every provider dashboard — measuring the same thing in the same way.

They measure what you put in. None of them measure what you get out.

I've been building AI agents professionally for two years. I've watched teams make expensive architectural decisions based on cost dashboards that only told half the story. I've seen companies underprice their AI products by $40–$80 per customer per month — not from lack of effort, but from lack of the right data.

After researching every LLM observability tool on the market and interviewing teams building production AI products, I'm convinced the industry has a measurement problem. And I want to propose six new metrics to fix it.

I'm calling the framework Token Intelligence.

Why Every LLM Tool Today Is Measuring the Wrong Thing

Every LLM observability tool today — without exception — measures input metrics.

How many tokens did I consume?
What did it cost?
Which model was slowest?
How many requests per day?

These are consumption metrics. They tell you what you spent. They describe the left side of the equation.

Nobody is measuring the right side.

What did those tokens produce?
What revenue did this call contribute to?
Were the tasks it powered actually completed?
Is the AI investment generating a return?

Think about how absurd this would be in any other context. Imagine running a sales team and only tracking how many calls your reps made — never whether those calls closed deals. Or running a factory and only tracking energy consumption — never whether the machines produced anything.

That's exactly what we're doing with LLMs.

We've built an entire observability industry around measuring inputs. Meanwhile, the questions that actually matter to the businesses running these systems go unanswered:

Which of my customers is actually profitable after LLM costs?
When I changed this prompt, did it improve my product or just reduce my bill?
Is my AI automation actually cheaper per completed task, or just cheaper per call?
What should I charge for my Pro plan?

None of these questions are answered by token counts or daily spend charts.

Here are six metrics that do answer them.

The 6 Token Intelligence Metrics

1. VPT — Value-per-Token

Formula: Business Value Generated ÷ Tokens Consumed

This is the headline metric. The number the industry is missing.

Two of your agents can consume identical token counts. One drives $10,000 in monthly customer outcomes. The other drives $200. Their costs are the same. Their VPT is 50x different.

Without VPT, you'd optimise both agents the same way. You'd try to reduce token consumption because that's what your dashboard tells you to do. With VPT, you'd realise the second agent isn't a cost problem — it's a value problem. It's generating almost no return for what it spends. No amount of prompt trimming fixes that. You need to rethink what it does.

VPT requires knowing the value side of each call. You can get there three ways:

Revenue sync — connect Stripe. Every token consumed by a paying customer gets attributed to that customer's MRR.
Outcome tagging — developers tag calls with the business value they generate: { status: 'completed', value_usd: 50.00 }. Five extra lines of code.
Pattern inference — detect quality signals from behaviour: retry rates, session abandonment, output anomalies. No tagging required.

VPT is the metric that turns "how much did we spend on AI?" into "what did our AI spending return?"

2. TCC — Task Completion Cost

Formula: Total LLM Cost for an Agent ÷ Number of Successfully Completed Tasks

Not cost per call. Cost per successful outcome.

This distinction sounds small. The implications are enormous.

Consider two invoice-parsing agents. Agent A costs $0.08 per call but fails 40% of the time, requiring an average of 1.7 retries per invoice. Agent B costs $0.14 per call and succeeds on the first attempt 95% of the time.

Everyone optimising for call cost would choose Agent A. It's cheaper, right?

Except Agent A's TCC is $0.08 × 1.7 = $0.136 per completed invoice. Agent B's TCC is $0.147 per completed invoice — barely more expensive, with dramatically better reliability.

Now flip the model choice. Agent A uses GPT-4o-mini (cheap). Agent B uses GPT-4o (expensive). Your dashboard screams at you to use Agent A. But TCC tells you the "expensive" model might actually be cheaper per completed task.

Cost per call is the wrong denominator. Success is what matters.

3. MAR — Margin At Risk

Formula: Projected Month-End LLM Cost (at current rate) − Revenue Per Customer

Every cost alert in every LLM tool today is reactive. It fires after you've spent too much. "You crossed $500 today" — thanks, but the damage is done.

MAR is predictive.

MAR looks at a customer's current usage burn rate, extrapolates to month end, and compares it against what that customer pays you. If the trajectory is bad, it fires now — while you still have time to act.

Instead of: "TechStartup Inc cost you $42 last month."

MAR says: "At TechStartup Inc's current usage rate, they will cost you $42 by month end. They're on a $29 plan. You have 11 days to intervene."

Those 11 days are the difference between proactively reaching out to move them to a higher tier, implementing per-customer usage caps, or optimising the agent they're hammering — versus discovering you've been losing money on a customer for three months.

Every AI product team charging customers a flat subscription fee needs MAR. The difference between "I found out last month" and "I found out with 11 days to act" compounds dramatically at scale.

4. AES — Agent Efficiency Score

Formula: Weighted composite of completion rate (40%), cost-per-task vs average (30%), retry rate inverse (20%), acceptance rate (10%). Score 0–100.

Let me be honest: most LLM dashboards are unreadable for founders and business stakeholders.

Latency percentiles. Token histograms. Daily request charts. These are useful for engineers debugging a specific issue. They are useless for a CEO trying to understand whether the AI product is working.

AES solves this by collapsing agent performance into a single number.

Above 70: healthy. Between 40 and 70: watch closely. Below 40: something is wrong.

Your invoice-parser agent scores 84. Your document-summariser scores 31. You don't need to understand the underlying model calls to know that one agent is performing well and one needs attention.

AES is what you put in a board deck. It's what you track in your weekly review. It's the number that tells you, without requiring engineering context, whether your AI is doing its job.

5. PRI — Prompt ROI Index

Formula: (Outcome Rate v2 / Cost v2) ÷ (Outcome Rate v1 / Cost v1). PRI > 1.0 = v2 has better return.

Every team building with LLMs iterates on prompts constantly. The problem is how you evaluate those iterations.

Most teams look at either quality or cost separately. Did the output get better? Did it get cheaper? These are the wrong questions asked in isolation.

PRI asks: did the return improve?

A prompt that costs 30% more but drives 80% better task completion has a PRI of 1.38. It's more expensive. It's also clearly better — your product works better, your customers get better outcomes, your VPT goes up. The cost increase is a worthwhile investment.

A prompt that costs 20% less but drives 40% worse task completion has a PRI of 0.67. You saved money. You also degraded your product. Your customers fail more often. Your TCC goes up. You traded product quality for a slightly lower bill.

PRI is the only way to evaluate a prompt change that accounts for both dimensions simultaneously. Without it, you're making decisions based on incomplete information every time you ship a new system prompt.

6. CPF — Cost-to-Price Floor

Formula: 90th Percentile LLM Cost Per Customer in a Plan Tier ÷ (1 − Target Gross Margin)

This is the most immediately valuable metric in the framework for any AI company charging customers a subscription fee.

It answers the question every founder guesses at: what is the minimum I can charge and still be profitable?

Here's how it works in practice.

You look at your Pro plan customers. The median LLM cost to serve one is $8/month. But the 90th percentile — your heaviest users — costs $28/month to serve. You set a target gross margin of 70%.

CPF = $28 ÷ (1 − 0.70) = $93.33

If you're charging $29/month for Pro, you're losing money on your heavy users and barely profitable on everyone else. You have a $64/month gap between what you charge and what you need to charge.

Across 40 Pro customers, that's $2,560/month in revenue left on the table — or worse, destroyed.

The average AI startup I've talked to discovers they're undercharging by $30–$80/month per customer when they first calculate CPF. They set their prices using competitor benchmarking and gut feel, not from looking at what their actual heaviest users cost to serve.

CPF makes that calculation automatic, continuous, and exact.

The Framework as a Whole

Here's how the six metrics fit together:

Metric	Question it answers	When you need it
VPT	Is my AI generating a return?	Always — this is your north star
TCC	Am I optimising for the right thing?	When choosing models or debugging agent efficiency
MAR	Which customers will make me unprofitable?	Before the end of every billing cycle
AES	Is each agent actually working?	Weekly business review
PRI	Did my prompt change help or hurt?	Every time you ship a new prompt
CPF	What should I charge?	Before any pricing decision

They're not independent metrics. They're a system. VPT tells you the headline. AES tells you which agents are underperforming. TCC helps you fix them. PRI guides how you iterate. MAR protects your margin while you're improving. CPF tells you what to charge once you have the data.

Why This Hasn't Existed Until Now

The honest answer: because measuring the value side of an LLM call is harder than measuring the cost side.

Cost is automatic. Every provider returns token counts in the API response. You can calculate cost without asking the developer for anything. That's why every tool does it — it's free data.

Value requires the developer's participation. You need to know what "success" means for a given call. A customer service agent that resolves the ticket is successful. One that sends the user in circles is not. Both consume the same tokens.

The three-layer approach I described for VPT — revenue sync, outcome tagging, and pattern inference — is how you bridge that gap. Each layer requires progressively more investment and returns progressively more precise data.

The outcome tagging layer is the one I'm most excited about. It's five extra lines of code:

await tf.outcome({
  status: 'completed',
  value_usd: 50.00,
  user_accepted: true
})

That's it. Those five lines tell the system that this task completed, it was worth $50 in business value, and the user accepted the output. From that signal, you can calculate TCC, PRI, and a precise VPT. You can feed that data into your AES score. You can use it to inform CPF calculations.

The developers who invest those five lines get access to intelligence that simply doesn't exist anywhere else in the market today.

What This Means for How You Build

If you're building an AI-powered product, the way you've been measuring it is incomplete. You know what you're spending. You don't know what you're getting.

That gap has consequences:

You're likely optimising for token reduction when you should be optimising for task completion
You're likely pricing based on competitor benchmarks when you should be pricing based on CPF
You're likely discovering margin problems after the invoice arrives when you could be catching them two weeks earlier
You're likely evaluating prompt changes by output quality and cost separately instead of by return

None of this is your fault. The tools you have access to don't measure these things. You can't manage what you can't see.

Token Intelligence is the framework for making these things visible.

Where This Goes Next

I'm building the platform that implements these six metrics — a Token Intelligence layer that sits on top of your existing LLM calls, requires two lines of code to integrate, and shows you VPT, TCC, MAR, AES, PRI, and CPF in a real-time dashboard.

No proxy. No traffic rerouting. No compliance risk. Your LLM calls go directly to the provider. Tokflo wraps the client and ships metadata asynchronously.

The free LLM pricing table — which feeds the CPF calculation and is updated within 48 hours of any provider change — I'll make it live soon.

If any of these metrics would change how you run your AI product, I'd genuinely like to hear about it. Reply below, or reach out directly.

The industry has been measuring the wrong thing. It's time to change that.

The six Token Intelligence metrics described in this article — VPT, TCC, MAR, AES, PRI, and CPF — are original framework definitions. If you reference them, a link back is appreciated.

DEV Community: Meenakshi Sundaram. C