Introduction
In the world of software development, open-source contributions not only help improve projects but also provide invaluable learning experiences. In this blog post, I'll share my journey of contributing to the EnglishFormatter project—a command-line tool designed to format, summarize, and paraphrase text documents using advanced language models.
Watch some of the functionality here
Project Overview
EnglishFormatter is a C++ command-line application that interacts with language models (LLMs) to process text files. It offers an interactive menu for users to select actions and supports customization through command-line flags.
Features:
- Format Documents: Enhance the readability and structure of text.
- Summarize Documents: Generate concise summaries.
- Paraphrase Documents: Rephrase text while retaining the original meaning.
-
Customizable Models: Choose different language models via the
--model
flag. -
Output Customization: Define custom suffixes for output files using the
--output
flag. -
Token Usage Reporting: Display token usage information with the new
--token-usage
flag.
GitHub Repository: EnglishFormatter
The New Feature: Token Usage Reporting
Understanding token usage is crucial when working with LLMs due to context length limitations and cost considerations. I added a new command-line flag --token-usage
(or -t
) that, when used, reports detailed information about the number of tokens sent in the prompt and received in the response.
Why Token Usage?
- Cost Management: Helps users estimate the cost of API calls.
- Debugging: Assists in optimizing prompts to stay within token limits.
- Performance Tuning: Enables users to fine-tune inputs for better responses.
Getting Started with EnglishFormatter
Prerequisites
- C++17 or higher: Required for compiling the code.
- libcurl: For HTTP requests.
- nlohmann/json: For JSON parsing.
- dotenv-cpp: To manage environment variables.
- An API Key: From the language model provider (e.g., OpenAI).
Installation
- Clone the Repository
git clone https://github.com/yourusername/EnglishFormatter.git
cd EnglishFormatter
- Install Dependencies
-
libcurl
- Windows: Use the pre-built binaries.
- Linux:
sudo apt-get install libcurl4-openssl-dev
-
nlohmann/json
Download the
json.hpp
file from the official repository. -
dotenv-cpp
git clone https://github.com/motdotla/dotenv-cpp.git
Include
dotenv.h
anddotenv.cpp
in your project.
- Set Up Environment Variables
Create a .env
file in the project root:
API_KEY=your_api_key_here
- Build the Project
g++ -std=c++17 -o EnglishFormatter main.cpp eng_format.cpp display.cpp dotenv.cpp -lcurl -pthread
Configuration
- API Key
Ensure your API key is set in the .env
file or as an environment variable.
-
Default Settings
-
Model: Default is
llama3-8b-8192
. -
Output Suffix: Default is
_modified
.
-
Model: Default is
Using EnglishFormatter
Run the application from the command line:
./EnglishFormatter [options]
Command-Line Options
Options:
-h, --help Show this help message and exit
-v, --version Show the tool's version and exit
-t, --token-usage Show token usage information
-m, --model MODEL Specify the model to use
-o, --output NAME Specify the output file suffix
Interactive Menu
Without any options, the tool launches an interactive menu:
./EnglishFormatter
Menu Options:
- Format document
- Summarize document
- Paraphrase document
- Exit
Navigate using the Up and Down arrow keys and select with Enter.
Example Usage
Formatting a Document with Token Usage Information
./EnglishFormatter --token-usage --model gpt-3.5-turbo --output _formatted
- Select: Format document
-
Enter:
sample.txt
Expected Output:
Token Usage for sample.txt:
Prompt Tokens: 50
Completion Tokens: 200
Total Tokens: 250
sample.txt has been Format document.
Summarizing a Document
./EnglishFormatter --token-usage
- Select: Summarize document
-
Enter:
report.txt
Expected Output:
Token Usage for report.txt:
Prompt Tokens: 100
Completion Tokens: 30
Total Tokens: 130
report.txt has been Summarize document.
Behind the Scenes: Code Modifications
Adding the Command-Line Flag
In cli.cpp
, I introduced a new boolean variable showTokenUsage
and updated the argument parsing loop:
bool showTokenUsage = false;
if (arg == "--token-usage" || arg == "-t") {
showTokenUsage = true;
}
Parsing Token Usage from the API Response
In eng_format.cpp
, I updated the parse_response
method to extract token usage:
struct TokenUsage {
int prompt_tokens = 0;
int completion_tokens = 0;
int total_tokens = 0;
};
std::string eng_format::parse_response(const std::string& response, TokenUsage& tokenUsage) {
json jsonResponse = json::parse(response);
if (jsonResponse.contains("usage")) {
const json& usage = jsonResponse["usage"];
tokenUsage.prompt_tokens = usage.value("prompt_tokens", 0);
tokenUsage.completion_tokens = usage.value("completion_tokens", 0);
tokenUsage.total_tokens = usage.value("total_tokens", 0);
}
// Extract the assistant's reply...
}
Displaying Token Usage Information
In convert_file
, I added a condition to output token usage:
if (showTokenUsage) {
std::cerr << "Token Usage for " << filename << ":\n";
std::cerr << " Prompt Tokens: " << tokenUsage.prompt_tokens << "\n";
std::cerr << " Completion Tokens: " << tokenUsage.completion_tokens << "\n";
std::cerr << " Total Tokens: " << tokenUsage.total_tokens << "\n";
}
Passing the Flag Through Classes
Updated the constructors and methods in both display
and eng_format
classes to accept the showTokenUsage
flag.
Challenges and Learnings
- Understanding Existing Code: It was crucial to grasp the original code structure to make seamless additions.
- API Response Handling: Ensuring that the token usage information was correctly extracted required careful parsing.
- Maintaining Code Style: Adhering to the original coding style was essential for consistency.
- Collaboration: Communicating with the project owner helped refine the feature and fix issues.
Conclusion
Contributing to EnglishFormatter was an enriching experience that enhanced my understanding of open-source collaboration and LLM integrations. The addition of token usage reporting makes the tool more transparent and helpful for users mindful of API costs and token limitations.
GitHub Repository: EnglishFormatter
Video Demo: EnglishFormatter Demo (Please watch the demo to see the tool in action.)
Future Work
- Support for Multiple LLM Providers: Extend compatibility with other APIs.
- Enhanced Error Handling: Improve feedback for network issues or invalid inputs.
- Additional Features: Implement text translation or grammar checking.
Call to Action
If you're interested in improving text processing tools or learning more about LLMs, feel free to contribute to the EnglishFormatter project. Whether it's adding new features, fixing bugs, or enhancing documentation, your contributions are welcome!
Contribute Here: EnglishFormatter on GitHub
Thank you for reading! If you have any questions or suggestions, please leave a comment below or open an issue on GitHub.
Top comments (0)