Implementing Token Count Optimization in repo-contextr

#openai #tooling #opensource #llm

Inspired by Repomix's Token Count Optimization feature, which I had explored in my previous blog, I decided to add a similar feature to my own project, repo-contextr. The idea was to help developers quickly find out how many tokens their repository would take when used with large language models. This helps plan for context limits and estimate API costs more easily.

Before starting the development, I created a feature request issue: Issue #18. The goal was to use the Tiktoken library by OpenAI for accurate token counting.

About Tiktoken:

Tiktoken is OpenAI’s fast tokenizer that can count tokens exactly as OpenAI models like GPT-3.5 and GPT-4 do. It’s widely used by tools like LangChain and LlamaIndex to calculate how much text fits into a model’s context window. Instead of guessing based on character length, it uses the same algorithm as real LLMs, giving developers a more accurate way to measure cost and context.

For the first version, I decided to start simple. Instead of integrating Tiktoken right away, I used an easier method that assumes one token for every four characters. This made it possible to test the idea quickly and get early feedback without adding a heavy dependency. Later, I planned to replace this logic with the real Tiktoken library in the next iteration, tracked under Issue #19.

Implementation Details

To build this feature, I started by creating a new branch just for this work. The idea was to keep my main branch stable and make development easier to manage. I wrote new modules for three main tasks — token counting, formatting, and CLI integration. The token_counter.py module handles the token count logic. It scans each file, skips binary files, and counts tokens using the four-character approximation. The results are also combined at the folder level to show the total token count per directory.

The token_tree_formatter.py module formats the results into a simple tree structure. It uses characters like ├── and └── to show folders and files clearly. This layout looks consistent with repo-contextr’s current output and helps developers easily see which parts of the repository take up more tokens. Files are sorted by size, and directory totals make it easier to find large sections quickly.

I also added new CLI options for users. These include --token-count-tree to show the full token tree, --token-threshold N to filter smaller files, and --tokens to show only the total token estimate. The feature blends with the existing CLI commands, so users can see token data directly in the usual output. Along with this, I updated cli.py, package.py, and report_formatter.py to support token-related data. Everything was tested to ensure it worked smoothly with the rest of the app.

Conclusion

This new feature adds more value to repo-contextr by letting developers estimate how big their repository is in terms of tokens. It helps identify which files or folders are the most token-heavy, making it easier to plan for LLM context limits and costs.

Even though this first version uses a simple character-based estimate, it sets a strong foundation for future improvements. The next step will be to integrate OpenAI’s Tiktoken library for accurate token counts. This project also reminded me how useful it is to keep work organized — using feature branches, writing clean and simple code, maintaining documentation, and keeping the Git history neat by squashing commits.

DEV Community

Implementing Token Count Optimization in repo-contextr

Implementation Details

Conclusion

Top comments (0)