How to count tokens in frontend for Popular LLM Models: GPT, Claude, and Llama

#tokenizer #webdev #javascript

Introduction

Today, apps using Language Learning Machines (LLM) are growing fast. People use LLMs a lot to solve tough problems. LLMs are important in many areas like education, money matters, health, and more. Seeing this, developers worldwide are making lots of new apps using LLM. These apps are changing how we live, work, and talk to each other.

Counting tokens before sending prompts to the Language Learning Model (LLM) is important for two reasons. First, it helps users manage their budget. Knowing how many tokens a prompt uses can prevent surprise costs. Second, it helps the LLM work better. The total tokens in a prompt should be less than the model's maximum. If it's more, the model might not work as well or might even make mistakes.

Tokenizer in Backend vs Frontend

In text processing, the calculation of prompt tokens is a crucial task and there are essentially two methods to accomplish this.

Backend Implementation

The first, and often most common, solution is to run a tokenizer in the backend system of the application. This approach involves exposing an Application Programming Interface (API) for the frontend to invoke when needed. This method is generally straightforward to implement, especially given the existence of Python libraries like tiktoken and tokenizers that are designed specifically for this purpose and are incredibly user-friendly.

However, there are some drawbacks. Firstly, it's inefficient as it requires sending large volumes of text to the backend to receive a simple number. This can be particularly wasteful when handling exceptionally long text. Secondly, it misuses server CPU resources since the CPUs are constantly calculating tokens, which doesn't significantly contribute to the product's value. Lastly, notable latency occurs when a user is typing and waiting for the token count, leading to a poor user experience.

Frontend Implementation

Thanks to transformers.js, we can run the tokenizer and model locally in the browser. Transformers.js is designed to be functionally equivalent to Hugging Face's transformers python library, meaning you can run the same pretrained models using a very similar API.

Installation

To install via NPM, run:



npm i @xenova/transformers

To run transformers on the client side of next.js, you need to update the next.config.js file:



/** @type {import('next').NextConfig} */
const nextConfig = {
    // (Optional) Export as a static site
    // See https://nextjs.org/docs/pages/building-your-application/deploying/static-exports#configuration
    output: 'export', // Feel free to modify/remove this option

    // Override the default webpack configuration
    webpack: (config) => {
        // See https://webpack.js.org/configuration/resolve/#resolvealias
        config.resolve.alias = {
            ...config.resolve.alias,
            "sharp$": false,
            "onnxruntime-node$": false,
        }
        return config;
    },
}

module.exports = nextConfig

Code Sample

Firstly, you need to import AutoTokenizer from @xenova/transformers:



import { AutoTokenizer } from "@xenova/transformers";

You can create a tokenizer using the AutoTokenizer.from_pretrained function, which requires the pretrained_model_name_or_path parameter. Xenova provides tokenizers designed for widely-used Language Learning Models (LLMs) like GPT-4, Claude-3, and Llama-3. To access these, visit the Hugging Face website, a hub for Machine Learning resources, at huggingface.co/Xenova. The tokenizer configurations for the latest GPT-4o model are available at Xenova/gpt-4o. You can create a tokenizer for GPT-4o now:



const tokenizer = await AutoTokenizer.from_pretrained('Xenova/gpt-4o');

The usage of the tokenizer is very similar to the tokenizer library in Python. The tokenizer.encode method can convert text into tokens.



const tokens = tokenizer.encode('hello world'); // [24912, 2375]

As you can see, the tokenizer of transformers.js is extremely easy to use. Due to its core code's implementation in Rust, it can calculate tokens at an impressive speed.