Fahad Ali Khan

Posted on Sep 21, 2024

Enhancing EnglishFormatter: A Journey into Open Source Contribution

#beginners #programming #webdev #tutorial

Introduction

In the world of software development, open-source contributions not only help improve projects but also provide invaluable learning experiences. In this blog post, I'll share my journey of contributing to the EnglishFormatter project—a command-line tool designed to format, summarize, and paraphrase text documents using advanced language models.
Watch some of the functionality here

Project Overview

EnglishFormatter is a C++ command-line application that interacts with language models (LLMs) to process text files. It offers an interactive menu for users to select actions and supports customization through command-line flags.

Features:

Format Documents: Enhance the readability and structure of text.
Summarize Documents: Generate concise summaries.
Paraphrase Documents: Rephrase text while retaining the original meaning.
Customizable Models: Choose different language models via the --model flag.
Output Customization: Define custom suffixes for output files using the --output flag.
Token Usage Reporting: Display token usage information with the new --token-usage flag.

GitHub Repository: EnglishFormatter

The New Feature: Token Usage Reporting

Understanding token usage is crucial when working with LLMs due to context length limitations and cost considerations. I added a new command-line flag --token-usage (or -t) that, when used, reports detailed information about the number of tokens sent in the prompt and received in the response.

Why Token Usage?

Cost Management: Helps users estimate the cost of API calls.
Debugging: Assists in optimizing prompts to stay within token limits.
Performance Tuning: Enables users to fine-tune inputs for better responses.

Getting Started with EnglishFormatter

Prerequisites

C++17 or higher: Required for compiling the code.
libcurl: For HTTP requests.
nlohmann/json: For JSON parsing.
dotenv-cpp: To manage environment variables.
An API Key: From the language model provider (e.g., OpenAI).

Installation

Clone the Repository

   git clone https://github.com/yourusername/EnglishFormatter.git
   cd EnglishFormatter

Install Dependencies

libcurl

Windows: Use the pre-built binaries.
Linux:

   sudo apt-get install libcurl4-openssl-dev

nlohmann/json

Download the json.hpp file from the official repository.

dotenv-cpp

 git clone https://github.com/motdotla/dotenv-cpp.git

Include dotenv.h and dotenv.cpp in your project.

Set Up Environment Variables

Create a .env file in the project root:

   API_KEY=your_api_key_here

Build the Project

   g++ -std=c++17 -o EnglishFormatter main.cpp eng_format.cpp display.cpp dotenv.cpp -lcurl -pthread

Configuration

API Key

Ensure your API key is set in the .env file or as an environment variable.

Default Settings
- Model: Default is llama3-8b-8192.
- Output Suffix: Default is _modified.

Using EnglishFormatter

Run the application from the command line:

./EnglishFormatter [options]

Command-Line Options

Options:
  -h, --help           Show this help message and exit
  -v, --version        Show the tool's version and exit
  -t, --token-usage    Show token usage information
  -m, --model MODEL    Specify the model to use
  -o, --output NAME    Specify the output file suffix

Interactive Menu

Without any options, the tool launches an interactive menu:

./EnglishFormatter

Menu Options:

Format document
Summarize document
Paraphrase document
Exit

Navigate using the Up and Down arrow keys and select with Enter.

Example Usage

Formatting a Document with Token Usage Information

./EnglishFormatter --token-usage --model gpt-3.5-turbo --output _formatted

Select: Format document
Enter: sample.txt

Expected Output:

Token Usage for sample.txt:
  Prompt Tokens:     50
  Completion Tokens: 200
  Total Tokens:      250
sample.txt has been Format document.

Summarizing a Document

./EnglishFormatter --token-usage

Select: Summarize document
Enter: report.txt

Expected Output:

Token Usage for report.txt:
  Prompt Tokens:     100
  Completion Tokens: 30
  Total Tokens:      130
report.txt has been Summarize document.

Behind the Scenes: Code Modifications

Adding the Command-Line Flag

In cli.cpp, I introduced a new boolean variable showTokenUsage and updated the argument parsing loop:

bool showTokenUsage = false;

if (arg == "--token-usage" || arg == "-t") {
    showTokenUsage = true;
}

Parsing Token Usage from the API Response

In eng_format.cpp, I updated the parse_response method to extract token usage:

struct TokenUsage {
    int prompt_tokens = 0;
    int completion_tokens = 0;
    int total_tokens = 0;
};

std::string eng_format::parse_response(const std::string& response, TokenUsage& tokenUsage) {
    json jsonResponse = json::parse(response);

    if (jsonResponse.contains("usage")) {
        const json& usage = jsonResponse["usage"];
        tokenUsage.prompt_tokens = usage.value("prompt_tokens", 0);
        tokenUsage.completion_tokens = usage.value("completion_tokens", 0);
        tokenUsage.total_tokens = usage.value("total_tokens", 0);
    }

    // Extract the assistant's reply...
}

Displaying Token Usage Information

In convert_file, I added a condition to output token usage:

if (showTokenUsage) {
    std::cerr << "Token Usage for " << filename << ":\n";
    std::cerr << "  Prompt Tokens:     " << tokenUsage.prompt_tokens << "\n";
    std::cerr << "  Completion Tokens: " << tokenUsage.completion_tokens << "\n";
    std::cerr << "  Total Tokens:      " << tokenUsage.total_tokens << "\n";
}

Passing the Flag Through Classes

Updated the constructors and methods in both display and eng_format classes to accept the showTokenUsage flag.

Challenges and Learnings

Understanding Existing Code: It was crucial to grasp the original code structure to make seamless additions.
API Response Handling: Ensuring that the token usage information was correctly extracted required careful parsing.
Maintaining Code Style: Adhering to the original coding style was essential for consistency.
Collaboration: Communicating with the project owner helped refine the feature and fix issues.

Conclusion

Contributing to EnglishFormatter was an enriching experience that enhanced my understanding of open-source collaboration and LLM integrations. The addition of token usage reporting makes the tool more transparent and helpful for users mindful of API costs and token limitations.