David Rivera

Posted on Oct 28

OSD600: Lab 6 — Repomix's token count tree feature and its prototyping in Rust

#cli #rust #showdev #tooling

Summary

I analyzed how Repomix implements its token-count-based repository insights—especially the Token Count Tree—and prototyped an analogous capability in my Rust CLI tool. My prototype adds two insights to the summary section:

Language breakdown by file extension (files, lines, MB, % of total lines)
Top files by line count (a quick “hotspot” view)

While Repomix focuses on token counts (via OpenAI’s tiktoken models), I started with line counts to deliver a fast, useful proof-of-concept with zero external dependencies. Below are my reading notes, code references, and design takeaways.

What is the Feature?

Repomix exposes “Token Count Optimization” and a “token count tree” visualization:

Docs: README → Token Count Optimization
- https://github.com/yamadashy/repomix (search for “Token Count Tree”)
CLI options include --token-count-tree and summary knobs like --top-files-len
Output integrates token metrics into summaries and trees, enabling threshold filtering and hotspot discovery

Where It Lives in Repomix

While Repomix is a TypeScript project, the architecture is modular. The token counting and metrics are implemented across a few core modules:

Token counting metrics
- src/core/metrics/TokenCounter.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/metrics/TokenCounter.ts
- src/core/metrics/tokenCounterFactory.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/metrics/tokenCounterFactory.ts
- src/core/metrics/calculateMetrics.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/metrics/calculateMetrics.ts
- src/core/tokenCount/buildTokenCountStructure.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/tokenCount/buildTokenCountStructure.ts
Output generation (where metrics flow into formatted output)
- src/core/output/outputGenerate.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/output/outputGenerate.ts
- src/core/output/outputSort.ts
- https://github.com/yamadashy/repomix/blob/main/src/core/output/outputSort.ts
Configuration
- README “Configuration Options” → output.topFilesLength, output.tokenCountTree, tokenCount.encoding, etc.

How It Works (High-Level)

A TokenCounter abstracts tokenization and can be configured for different encodings (e.g., o200k_base for GPT-4o).
Repomix builds a structured view of token counts per file and per directory (buildTokenCountStructure.ts), enabling tree visualization and threshold filtering.
Metrics are computed centrally (calculateMetrics.ts), then consumed by output generators to render in XML/Markdown/JSON.
Configuration flags control whether to include summaries, how many top files to display, and whether to include a token count tree.

What I Learned Reading the Code

Separation of Concerns: Token counting is isolated from output generation via metrics modules. This keeps formatting logic simple and makes metrics reusable.
Pluggable Encodings: The factory for TokenCounter makes it easy to switch models/encodings.
Tree Building: The token count tree is computed from a structured aggregation (not rendered-on-the-fly), which simplifies sorting and thresholding.
Config-Driven Output: The same metrics flow to multiple output styles with minimal branching.

Strategies I Used to Read the Code

GitHub UI browsing of src/core/metrics, src/core/tokenCount, and src/core/output
Skimmed README to map CLI flags to modules
Targeted searches (file names and keywords like TokenCounter, tokenCountTree) to find implementations
Cross-referenced type names between modules to follow data flow

Prototype in My Rust Tool

I implemented a fast proof-of-concept using line counts instead of tokens:

File: src/output.rs
Functionality added in the Summary section:
- Language breakdown (by file extension): files, lines, bytes, and % of total lines
- Top files by lines (first 10)

Rationale:

Zero new deps; leverages existing FileContext (with lines and size)
Mirrors Repomix’s “top files” and “token-tree” spirit with a simpler metric
Stable prototype surface for future evolution to real token counting

Next Steps (Planned Enhancements)

Add CLI/config knobs to toggle these sections and set list lengths (e.g., --top-files-len)
Improve language detection (map extensions to canonical names; optionally integrate linguist-like detection)
Introduce token counting via a Rust tokenizer (tiktoken-rs or tokenizers), behind a feature flag
Add a “Line Count Tree” with optional threshold to mirror Repomix’s token-count-tree UX, then later swap lines→tokens
Expand tests to cover new summary content deterministically

Much of this is already being tracked in their respective issues

Open Questions

Which tokenizer/encoding should we support first (o200k_base vs cl100k_base)?
How should we handle binary and generated files in metrics? (Repomix defaults exclude large/binary by default)
Where to surface metrics in non-Markdown outputs (JSON, plain)?

DEV Community