Counting tokens is dumb. So we built a free metric for AI proficiency.

Charlie Graham — Wed, 20 May 2026 21:59:12 +0000

We’ve been trying to figure out a real answer to a question that keeps coming up: how do you measure whether someone is actually good at Claude Code, Codex, and the other AI coding tools? Not "do they use them," but how good are they at using AI.

The first metric we looked at, like everyone else, was token usage. It’s the only number you can pull out of the box. Anthropic and OpenAI hand you spend data in the console. Spend correlates with cost. Cost is something finance asks about. So token usage becomes an easy first answer.

But obviously counting tokens sucks as a metric.

What we noticed when we looked at the actual sessions

When we started reading session logs from people who were clearly good with these tools, and people who were clearly struggling, both groups burned tokens. Sometimes the strugglers burned more.

A senior developer who has refined their workflow ships in 100,000 tokens what a junior chews through a million on. The high-skill move is fewer turns, sharper prompts, smaller context windows, more planning up front.

Rank by token spend and you end up rewarding the things that make people slower:

Padded context with files that aren’t relevant
Brute-forcing with longer and longer prompts
Staying in chat mode forever instead of building reusable workflows

On top of all this, once a company starts measuring token usage, the incentive flips from accidental to deliberate. If your performance review or “AI adoption KPI” depends on token spend, the rational move is to burn tokens on purpose. We’ve already heard about people writing scripts that loop the model on busywork just to pump their number. The metric becomes the work, and the work stops mattering.

We’ve seen the same critique made about lines of code and commit counts. Volume isn’t skill. It just looks like it on a dashboard.

So we tried looking at something else

We started watching for things you can read from local session activity that show how someone configured the tool, not how much they spent on it.

Eight things kept clustering together. People with two of them usually had four. People with five usually had close to all of them.

Customization — CLAUDE.md, AGENTS.md, custom slash commands, hooks. How much did they shape the tool to their workflow, vs run defaults?
Parallel Agents — Are they using multiple agents working at once, or one chat at a time?
Background Work — Tasks delegated to run unattended, or babysitting every turn?
Tool Breadth — To what degree do they have MCP servers, skills, plugins wired into the environment?
Planning — Plan mode, structured /spec / /plan workflows, or jumping straight to file edits?
Repetition — Skill breadth and skill depth, measured separately. A lot of people install skills they never actually use.
Custom Skills — Written their own reusable workflows for things they do more than once?
Multi-Tasking — AI treated as a team running in parallel, or as a single chat window?

Plus a few tool-specific ones we added later for Codex and Cowork.

Each of these is observable from session activity. No self-report, no interview answer to game.

So we built a free metric

We turned the framework into AIQ Rank. AIQ Rank reads local session activity from whatever AI coding tools you’re using (Claude Code, Codex, Cursor, OpenCode, Cowork) and scores you 0-1000 across the eleven dimensions. Think of it as a credit score, but for AI fluency.

We made it free. It runs locally — transcripts never leave your machine. You get a number, a per-dimension breakdown, and a profile URL to share if you want to.

The score is the hook. The breakdown is the part that’s actually useful. When we ran it on ourselves the first time, the dimension that surprised us wasn’t a strength we expected. It was a weakness we hadn’t noticed.

What we’d suggest doing with it

If you ran a quick sanity check on your team — top 10% by token spend, top 10% by AIQ Rank — we suspect the overlap would be smaller than you’d expect. Some of the token-heavy people are still brute-forcing every problem in chat mode. Some of the lower-spend people quietly built skills, wired up MCPs, learned plan mode, and run parallel agents.

That gap is the interesting part. Not because token spend is bad data, but because it’s a different question than “who’s good at this.”

If you want to do this comparison across your own team, AIQ Rank has private team leaderboards — invite people, scores aggregate to a board only your group sees. Transcripts still never leave anyone’s machine.

What do you think?

Let us know what changes/improvements we should make and please try it out! Do you agree/disagree with these parameters?