tokencount is a new Rust CLI that helps you answer a deceptively simple question: how many GPT tokens are hidden across this project? If you build AI features, write prompt-heavy docs, or just keep an eye on context windows, this tool makes the audit painless.
Why I built it
Most token counters either run one file at a time or ignore the filesystem realities of big projects. I wanted something that:
- Walks a codebase quickly (parallel rayon workers + OS-native ignore rules)
- Respects
.gitignore
by default and lets me layer custom--exclude
globs - Talks the same language as OpenAI models (
cl100k_base
,o200k_base
, etc.) - Gives a useful summary out of the box: per-file counts, totals, percentiles, and top-N offenders
- Plays nicely with automation (JSON and NDJSON streaming modes)
Features at a glance
-
Blazing fast scan –
ignore::WalkBuilder
+ Rayon for concurrent IO/tokenization -
Smart defaults – only scans
*.elm
unless you add--include-ext
flags (good for Elm-heavy repos) -
Flexible filtering – combine
--include-ext
,--exclude
,--max-bytes
, and--follow-symlinks
- Multiple outputs – table, JSON array with summary, or NDJSON stream for pipelines
- Rich stats – totals, average per file, and P50/P90/P99 percentiles to spot outliers fast
- Quiet/verbose modes – keep CI logs clean or turn on detailed warnings locally
Install
cargo install tokencount
Quick tour
# default: scan current directory, only *.elm files, table output
tokencount
# include Elm + TypeScript
tokencount ./frontend --include-ext elm --include-ext ts
# show top 10 largest files by tokens
tokencount --top 10
# machine-readable summary for CI
tokencount --format json > tokens.json
# streaming counts for further processing
tokencount --format ndjson
# sort descending by token count
tokencount --sort tokens
Each run ends with a footer like this:
---
total files: 42
total tokens: 128730
average/file: 3065.00
p50: 812
p90: 7194
p99: 24403
Need only the top offenders? Combine --top N
with either --sort tokens
or the default path sort.
Under the hood
-
Ignore handling uses the
ignore
crate with.gitignore
,.git/info/exclude
, and global git ignores respected automatically. I add common junk folders (node_modules
,target
,.git
) so you don’t have to. -
Tokenization relies on
tiktoken-rs
, so you get the same counts as OpenAI’scl100k_base
/o200k_base
models. -
Error handling is friendly by default—non UTF-8 files or oversized blobs are skipped with warnings (or silently with
--quiet
). - Percentiles use a nearest-rank approach and degrade gracefully when there are zero files.
Roadmap & feedback
I’m exploring:
- More encodings (if you need a different tokenizer, open an issue)
- Optional HTML/Markdown report outputs
- Built-in file size histogram to complement token stats
Repo & issues live here: github.com/CharlonTank/tokencount
If you try tokencount
, I’d love to hear how it fits into your prompt engineering workflow or CI pipelines—reach out in the repo or drop a comment below.
Top comments (0)