Weston G

Posted on May 14

I built a client-side LLM token counter because I kept guessing at prompt costs

#llm #webdev #productivity #opensource

Estimated read time: 4 minutes

I was building a RAG pipeline last month. Standard stuff — system prompt, some retrieved chunks, user message. Somewhere around the third iteration of tweaking the system prompt I realized I had absolutely no idea what I was spending per request.

The system prompt alone was a wall of text. Around 500 words. That's fine in isolation, but this thing was getting prepended to every single request, and I had no intuition for what that translates to in dollars.

So I opened the OpenAI Tokenizer, pasted the text, got a number. Then I had to open a calculator to multiply tokens × price. Then I wanted to compare to Claude's pricing. Then Gemini. By request five I was annoyed enough to just build the thing.

The concrete example that made it real

Here's a system prompt I actually had sitting in my codebase:

You are a helpful assistant for a B2B SaaS product. You help users understand 
their billing, navigate features, and troubleshoot common issues. Always be 
concise. When you don't know something, say so rather than guessing...
[+ ~450 more words of context, examples, and edge case instructions]

~500 words. Let's call it 650 tokens (words aren't tokens — more on that below).

At current pricing, that single system prompt costs:

Model	Input price	Cost per request	Cost per 10k requests
GPT-4o	$2.50/1M	$0.001625	$16.25
GPT-3.5 Turbo	$0.50/1M	$0.000325	$3.25
Claude 3 Haiku	$0.25/1M	≈ $0.000163	≈ $1.63
Gemini 1.5 Flash	$0.075/1M	≈ $0.0000488	≈ $0.49

That's a 33x cost difference between GPT-4o and Gemini 1.5 Flash for the exact same text. At 10k daily active users doing a few requests each, you're looking at the difference between a $50/day line item and a $1,600/day one.

And that's just the system prompt. Add the user message, the conversation history, the retrieved chunks... the numbers compound fast.

What I built

LLM Token Counter — paste text, pick a model, get token count + cost estimate instantly. No account, no backend, your text never leaves the browser.

Supports:

GPT-4o ($2.50/1M input)
GPT-3.5 Turbo ($0.50/1M input)
Claude 3 Haiku ($0.25/1M input)
Gemini 1.5 Flash ($0.075/1M input)

How it works technically

The core is js-tiktoken, which is a WASM port of OpenAI's tokenizer that runs entirely in the browser. For GPT models, this gives you the exact same token count you'd get from the API — same cl100k_base encoding, same results.

The interesting constraint: Claude and Gemini both use proprietary tokenizers that aren't publicly available as standalone libraries. Anthropic doesn't ship a JS tokenizer. Google's @google/generative-ai SDK does tokenization via API call (which defeats the whole purpose of a client-side tool).

My workaround: use cl100k_base for Claude and Gemini too, and flag the result as approximate. For typical English prose, cl100k gives you counts within about 5% of what Anthropic and Google's actual tokenizers would return. For code or non-English text, the variance is higher.

It's not perfect, but "within 5% for budgeting purposes" is way more useful than "no idea." I put a ≈ symbol next to those counts so you know they're estimates.

// The whole counting function — it's genuinely this simple
function countTokens(text: string, model: Model): number {
  if (!text) return 0;
  const enc = encodingForModel(model.tiktokenId);
  const tokens = enc.encode(text);
  return tokens.length;
}

For Claude and Gemini I pass "gpt-4" as the tiktokenId, which maps to cl100k_base. It works. It's a proxy. I'm calling it out rather than hiding it.

Why client-side matters here

I thought about building this as a tiny serverless function. It would have been five minutes of work.

But I kept coming back to the fact that the whole use case is "paste your prompts in." People's prompts contain confidential stuff — product context, internal workflows, sometimes actual user data they're debugging with. Shipping that to a server I control (even one that immediately discards it) felt wrong for a tool this casual.

WASM in the browser means I'm not in that data path at all. The tokenizer runs locally, the number appears instantly, nothing is logged anywhere. That also makes it fast — no network round trip, no cold starts, results update as you type.

What it doesn't do (yet)

Output token pricing (completions cost differently, and estimating output is harder since you don't know the length)
Batch cost estimator (X requests/day at Y avg tokens)
GPT-4o with vision or audio pricing
Newer models — I'll update the price table as things change

If any of those would be useful to you, the source is on GitHub and PRs are open.

The thing I learned building this

Words ≠ tokens in ways that bite you. "tokenization" is 3 tokens in cl100k. A JSON object with lots of punctuation tokenizes way worse than the same information in prose. Code is somewhere in between.

Rule of thumb I use now: English prose ≈ 0.75 tokens/word. Code ≈ 0.5–0.6 tokens/word. JSON with lots of keys ≈ 1.2–1.5 tokens/word equivalent. The counter makes it concrete — I've started pasting my prompts in before I finalize them.

What's your approach to tracking LLM costs during development? Are you eyeballing it, running it through the OpenAI playground, or do you have something more systematic set up? Curious what's actually working for people.

Top comments (2)

Harjot Singh • May 31

The 500-word-system-prompt-prepended-to-every-request realization is the one that flips people from guessing to measuring, because that fixed prefix is the most insidious cost: it feels free since you wrote it once, but you pay for it on every single call forever, multiplied by your entire request volume. A token counter makes that visceral in a way a pricing page never does. Two things the counter usually reveals once you have it: the system prompt is almost always bloated (half of it is politeness and redundant instruction the model doesn't need), and the retrieved chunks are where the real spend hides, so trimming retrieval to the few that matter beats trimming the prompt. Client-side and no-API-call is the right design too, you want this in the inner loop while you iterate, not as a billing report you read after the damage. The deeper habit it builds is treating context as a budget you spend rather than a free buffer you fill. That make-the-cost-visible-at-authoring-time instinct is core to how I think about spend in Moonshift. Did building it change how you write system prompts, or mostly how aggressively you trim retrieved context?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.