Charlie Graham

Posted on May 21

Counting tokens is dumb. So we built a free metric for AI proficiency.

#ai #programming #claude #openai

We’ve been trying to figure out a real answer to a question that keeps coming up: how do you measure whether someone is actually good at Claude Code, Codex, and the other AI coding tools? Not "do they use them," but how good are they at using AI.

The first metric we looked at, like everyone else, was token usage. It’s the only number you can pull out of the box. Anthropic and OpenAI hand you token data in the console. So token usage becomes an easy first answer.

But obviously counting tokens sucks as a metric.

What we noticed when we looked at the actual sessions

When we started reading session logs from people who were clearly good with these tools, and people who were clearly struggling, both groups burned tokens. Sometimes the strugglers burned more.
A senior developer who has refined their workflow ships in 1 million tokens what a junior chews through 10 million on. The high-skill move is fewer turns, sharper prompts, smaller context windows, more planning up front.

Rank by token spend and you end up rewarding the things that make people slower:

Padded context with files that aren’t relevant
Brute-forcing with longer and longer prompts
Staying in chat mode forever instead of building reusable workflows

On top of all this, once a company starts measuring token usage, the incentive flips from accidental to deliberate. If your performance review or “AI adoption KPI” depends on token counts, the rational move is to burn tokens on purpose. We’ve already heard about people writing scripts that loop the model on busywork just to pump their number.

We’ve seen the same critique made about lines of code and commit counts. Volume isn’t skill. It just looks like it on a dashboard.

So we tried looking at something else

Instead, we started watching for things you can read from local session activity that show how someone configured the tool, not how much they spent on it.

Eight things kept clustering together. People with two of them usually had four. People with five usually had close to all of them.

Customization — CLAUDE.md, AGENTS.md, custom slash commands, hooks. How much did they shape the tool to their workflow, vs run defaults?
Parallel Agents — Are they using multiple agents working at once, or one chat at a time?
Background Work — Tasks delegated to run unattended, or babysitting every turn?
Tool Breadth — To what degree do they have MCP servers, skills, plugins wired into the environment?
Planning — Plan mode, structured /spec / /plan workflows, or jumping straight to file edits?
Repetition — Skill breadth and skill depth, measured separately. A lot of people install skills they never actually use.
Custom Skills — Written their own reusable workflows for things they do more than once?
Multi-Tasking — AI treated as a team running in parallel, or as a single chat window?

Plus a few tool-specific ones we added later for Codex and Cowork.
Each of these is observable from session activity. No self-report, no interview answer to game.

So we built a free metric

We turned the framework into AIQ Rank. AIQ Rank reads local session activity from whatever AI coding tools you’re using (Claude Code, Codex, Cursor, OpenCode, Cowork) and scores you 0-1000 across the eleven dimensions. Think of it as a credit score, but for AI fluency.

We made it free. It runs locally — transcripts never leave your machine. You get a number, a per-dimension breakdown, and a profile URL to share if you want to.

The score is the hook. The breakdown is the part that’s actually useful. When we ran it on ourselves the first time, the dimension that surprised us wasn’t a strength we expected. It was a weakness we hadn’t noticed.

What we’d suggest doing with it

If you ran a quick sanity check on your team - top 10% by token spend, top 10% by AIQ Rank - we suspect the overlap would be smaller than you’d expect. Some of the token-heavy people are still brute-forcing every problem in chat mode. Some of the lower-spend people quietly built skills, wired up MCPs, learned plan mode, and run parallel agents.

That gap is the interesting part. Not because token spend is bad data, but because it’s a different question than “who’s good at this.”

If you want to do this comparison across your own team, AIQ Rank has private team leaderboards. Invite people and see a leaderboard for just your team including the skills used between them. Transcripts still never leave anyone’s machine.

What do you think?

AIQRank is still a work in progress and we care constantly improving it. Do you agree/disagree with these parameters? Let us know what changes/improvements we should make and please try it out!

Again you can get your score for free in 60 seconds from aiqrank.com. We’d love any constructive feedback!

Top comments (3)

Ken Imoto • May 21

The "junior burns 10M tokens to ship what a senior does in 1M" observation matches exactly what I see running my own harness - cost-per-published-article dropped roughly 5x not because the model got cheaper but because I learned to stop dumping whole repos into context and started writing tight skills with concrete acceptance criteria. The risk with AIQ Rank is the usual one for proxy metrics: once it's measured, it'll get gamed (people will instrument tool calls just to score the "parallel agents" dimension). Any empirical correlation against actual shipped output planned? That's the thing that would make this much harder to dismiss.

Charlie Graham • May 22

Thanks, Ken.

Totally agree about the risk of gaming the. It will definitely be a cat and mouse game. One reason we have the submissions go to our site is so we can have hidden heuristics so we can better distinguish fake submissions from real ones.

And great question about shipped output vs score. We are still gathering empirical data as well as tweaking the score as we learn more. Anecdotally, so far the top performing engineers (and others) also are correlating to the top scores.

Mykola Kondratiuk • May 31

fair on proficiency metrics, but tokens are still the right unit for budget enforcement. the "how skilled is this dev" question and the "has this agent loop gone haywire" question need separate answers anyway