Summary
I analyzed how Repomix implements its token-count-based repository insights—especially the Token Count Tree—and prototyped an analogous capability in my Rust CLI tool. My prototype adds two insights to the summary section:
- Language breakdown by file extension (files, lines, MB, % of total lines)
- Top files by line count (a quick “hotspot” view)
While Repomix focuses on token counts (via OpenAI’s tiktoken models), I started with line counts to deliver a fast, useful proof-of-concept with zero external dependencies. Below are my reading notes, code references, and design takeaways.
What is the Feature?
Repomix exposes “Token Count Optimization” and a “token count tree” visualization:
- Docs: README → Token Count Optimization
- https://github.com/yamadashy/repomix (search for “Token Count Tree”)
- CLI options include
--token-count-treeand summary knobs like--top-files-len - Output integrates token metrics into summaries and trees, enabling threshold filtering and hotspot discovery
Where It Lives in Repomix
While Repomix is a TypeScript project, the architecture is modular. The token counting and metrics are implemented across a few core modules:
-
Token counting metrics
- src/core/metrics/TokenCounter.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/metrics/TokenCounter.ts
- src/core/metrics/tokenCounterFactory.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/metrics/tokenCounterFactory.ts
- src/core/metrics/calculateMetrics.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/metrics/calculateMetrics.ts
- src/core/tokenCount/buildTokenCountStructure.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/tokenCount/buildTokenCountStructure.ts
-
Output generation (where metrics flow into formatted output)
- src/core/output/outputGenerate.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/output/outputGenerate.ts
- src/core/output/outputSort.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/output/outputSort.ts
-
Configuration
- README “Configuration Options” → output.topFilesLength, output.tokenCountTree, tokenCount.encoding, etc.
How It Works (High-Level)
- A TokenCounter abstracts tokenization and can be configured for different encodings (e.g., o200k_base for GPT-4o).
- Repomix builds a structured view of token counts per file and per directory (buildTokenCountStructure.ts), enabling tree visualization and threshold filtering.
- Metrics are computed centrally (calculateMetrics.ts), then consumed by output generators to render in XML/Markdown/JSON.
- Configuration flags control whether to include summaries, how many top files to display, and whether to include a token count tree.
What I Learned Reading the Code
- Separation of Concerns: Token counting is isolated from output generation via metrics modules. This keeps formatting logic simple and makes metrics reusable.
- Pluggable Encodings: The factory for TokenCounter makes it easy to switch models/encodings.
- Tree Building: The token count tree is computed from a structured aggregation (not rendered-on-the-fly), which simplifies sorting and thresholding.
- Config-Driven Output: The same metrics flow to multiple output styles with minimal branching.
Strategies I Used to Read the Code
- GitHub UI browsing of src/core/metrics, src/core/tokenCount, and src/core/output
- Skimmed README to map CLI flags to modules
- Targeted searches (file names and keywords like TokenCounter, tokenCountTree) to find implementations
- Cross-referenced type names between modules to follow data flow
Prototype in My Rust Tool
I implemented a fast proof-of-concept using line counts instead of tokens:
- File: src/output.rs
- Functionality added in the Summary section:
- Language breakdown (by file extension): files, lines, bytes, and % of total lines
- Top files by lines (first 10)
Rationale:
- Zero new deps; leverages existing FileContext (with lines and size)
- Mirrors Repomix’s “top files” and “token-tree” spirit with a simpler metric
- Stable prototype surface for future evolution to real token counting
Next Steps (Planned Enhancements)
- Add CLI/config knobs to toggle these sections and set list lengths (e.g., --top-files-len)
- Improve language detection (map extensions to canonical names; optionally integrate linguist-like detection)
- Introduce token counting via a Rust tokenizer (tiktoken-rs or tokenizers), behind a feature flag
- Add a “Line Count Tree” with optional threshold to mirror Repomix’s token-count-tree UX, then later swap lines→tokens
- Expand tests to cover new summary content deterministically
Much of this is already being tracked in their respective issues
Open Questions
- Which tokenizer/encoding should we support first (o200k_base vs cl100k_base)?
- How should we handle binary and generated files in metrics? (Repomix defaults exclude large/binary by default)
- Where to surface metrics in non-Markdown outputs (JSON, plain)?
Links
- Repomix repo: https://github.com/yamadashy/repomix
- Token Counter: https://github.com/yamadashy/repomix/blob/main/src/core/metrics/TokenCounter.ts
- Token structure: https://github.com/yamadashy/repomix/blob/main/src/core/tokenCount/buildTokenCountStructure.ts
- Output generator: https://github.com/yamadashy/repomix/blob/main/src/core/output/outputGenerate.ts
Top comments (0)