DEV Community

Ramu Narasinga
Ramu Narasinga

Posted on • Originally published at thinkthroo.com

Here's how OpenAI Token count is computed in Tiktokenizer - Part 1

In this article, we will review how OpenAI token count is computed in tiktokenizer. We will look at:

  1. TokenViewer component

  2. tokenizer

  3. createTokenizer function

TokenViewer component

In tiktokenizer/src/pages/index.ts, at line 86, you will find the following code

 <section className="flex flex-col gap-4">
    <TokenViewer model={model} data={tokens} isFetching={false} />
 </section>
Enter fullscreen mode Exit fullscreen mode

And TokenViewer is imported as shown below

 

import { TokenViewer } from "~/sections/TokenViewer";
Enter fullscreen mode Exit fullscreen mode

This component has three props

  1. model

  2. data

  3. isFetching

So how does this TokenViewer component look like? 

tokenizer

At line 43 in tiktokenizer/src/pages/index.ts, you will find the following code:

const tokenizer = useQuery({
    queryKey: [model],
    queryFn: ({ queryKey: [model] }) => createTokenizer(model!),
  });

  const tokens = tokenizer.data?.tokenize(inputText);
Enter fullscreen mode Exit fullscreen mode

useQuery is imported as shown below:

import { useQuery } from "@tanstack/react-query";
Enter fullscreen mode Exit fullscreen mode

createTokenizer function is imported as shown below

import { createTokenizer } from "~/models/tokenizer";
Enter fullscreen mode Exit fullscreen mode

createTokenizer function

In tiktokenizer/src/models/tokenizer.ts, you will find the following code at line 122:

export async function createTokenizer(name: string): Promise<Tokenizer> {
  console.log("createTokenizer", name);
  const oaiEncoding = oaiEncodings.safeParse(name);
  if (oaiEncoding.success) {
    console.log("oaiEncoding", oaiEncoding.data);
    return new TiktokenTokenizer(oaiEncoding.data);
  }
  const oaiModel = oaiModels.safeParse(name);
  if (oaiModel.success) {
    console.log("oaiModel", oaiModel.data);
    return new TiktokenTokenizer(oaiModel.data);
  }

  const ossModel = openSourceModels.safeParse(name);
  if (ossModel.success) {
    console.log("loading tokenizer", ossModel.data);
    const tokenizer = await OpenSourceTokenizer.load(ossModel.data);
    console.log("loaded tokenizer", name);
    return new OpenSourceTokenizer(tokenizer, name);
  }
  throw new Error("Invalid model or encoding");
}
Enter fullscreen mode Exit fullscreen mode

oaiEncodings, oaiModels, openSourceModels are imported as shown below

 

import { oaiEncodings, oaiModels, openSourceModels } from ".";
Enter fullscreen mode Exit fullscreen mode

So this function either returns:

  1. TiktokenTokenizer

  2. OpenSourceTokenizer

You will find more information about oaiEncodings, oaiModels, openSourceModels in tiktokenizer/src/models/tokenizer.ts

You will learn more about TiktokenTokenizer and OpenSourceTokenizer in next article.

About me:

Hey, my name is Ramu Narasinga. I study codebase architecture in large open-source projects.

Email: ramu.narasinga@gmail.com

Want to learn from open-source code? Solve challenges inspired by open-source projects.

References:

  1. https://github.com/dqbd/tiktokenizer/blob/master/src/pages/index.tsx#L86

  2. https://github.com/dqbd/tiktokenizer/blob/master/src/pages/index.tsx#L48

  3. https://github.com/dqbd/tiktokenizer/blob/master/src/pages/index.tsx#L43

  4. https://github.com/dqbd/tiktokenizer/blob/master/src/models/tokenizer.ts#L122

  5. https://tiktokenizer.vercel.app/

Top comments (0)