DEV Community

David Rivera
David Rivera

Posted on

OSD600: Lab 6 — Repomix's token count tree feature and its prototyping in Rust

Summary

I analyzed how Repomix implements its token-count-based repository insights—especially the Token Count Tree—and prototyped an analogous capability in my Rust CLI tool. My prototype adds two insights to the summary section:

  • Language breakdown by file extension (files, lines, MB, % of total lines)
  • Top files by line count (a quick “hotspot” view)

While Repomix focuses on token counts (via OpenAI’s tiktoken models), I started with line counts to deliver a fast, useful proof-of-concept with zero external dependencies. Below are my reading notes, code references, and design takeaways.

What is the Feature?

Repomix exposes “Token Count Optimization” and a “token count tree” visualization:

  • Docs: README → Token Count Optimization
  • CLI options include --token-count-tree and summary knobs like --top-files-len
  • Output integrates token metrics into summaries and trees, enabling threshold filtering and hotspot discovery

Where It Lives in Repomix

While Repomix is a TypeScript project, the architecture is modular. The token counting and metrics are implemented across a few core modules:

How It Works (High-Level)

  • A TokenCounter abstracts tokenization and can be configured for different encodings (e.g., o200k_base for GPT-4o).
  • Repomix builds a structured view of token counts per file and per directory (buildTokenCountStructure.ts), enabling tree visualization and threshold filtering.
  • Metrics are computed centrally (calculateMetrics.ts), then consumed by output generators to render in XML/Markdown/JSON.
  • Configuration flags control whether to include summaries, how many top files to display, and whether to include a token count tree.

What I Learned Reading the Code

  • Separation of Concerns: Token counting is isolated from output generation via metrics modules. This keeps formatting logic simple and makes metrics reusable.
  • Pluggable Encodings: The factory for TokenCounter makes it easy to switch models/encodings.
  • Tree Building: The token count tree is computed from a structured aggregation (not rendered-on-the-fly), which simplifies sorting and thresholding.
  • Config-Driven Output: The same metrics flow to multiple output styles with minimal branching.

Strategies I Used to Read the Code

  • GitHub UI browsing of src/core/metrics, src/core/tokenCount, and src/core/output
  • Skimmed README to map CLI flags to modules
  • Targeted searches (file names and keywords like TokenCounter, tokenCountTree) to find implementations
  • Cross-referenced type names between modules to follow data flow

Prototype in My Rust Tool

I implemented a fast proof-of-concept using line counts instead of tokens:

  • File: src/output.rs
  • Functionality added in the Summary section:
    • Language breakdown (by file extension): files, lines, bytes, and % of total lines
    • Top files by lines (first 10)

Rationale:

  • Zero new deps; leverages existing FileContext (with lines and size)
  • Mirrors Repomix’s “top files” and “token-tree” spirit with a simpler metric
  • Stable prototype surface for future evolution to real token counting

Next Steps (Planned Enhancements)

  • Add CLI/config knobs to toggle these sections and set list lengths (e.g., --top-files-len)
  • Improve language detection (map extensions to canonical names; optionally integrate linguist-like detection)
  • Introduce token counting via a Rust tokenizer (tiktoken-rs or tokenizers), behind a feature flag
  • Add a “Line Count Tree” with optional threshold to mirror Repomix’s token-count-tree UX, then later swap lines→tokens
  • Expand tests to cover new summary content deterministically

Much of this is already being tracked in their respective issues

Open Questions

  • Which tokenizer/encoding should we support first (o200k_base vs cl100k_base)?
  • How should we handle binary and generated files in metrics? (Repomix defaults exclude large/binary by default)
  • Where to surface metrics in non-Markdown outputs (JSON, plain)?

Links

Top comments (0)