Counting Tokens: Sorting Through the Details

#webdev #llm #python

The women of the Six Triple Eight faced a monumental challenge: deciphering incomplete addresses, nicknames, and smudged handwriting under strict time constraints. Similarly, when fine-tuning custom data with OpenAI data, understanding token usage is crucial—not only to ensure the model can handle complex tasks but also to manage costs effectively.

Using Tiktoken, we calculate the token count in our text data to stay within OpenAI's token limits and optimize efficiency. Fine-tuning a model isn’t just a technical challenge; it comes with financial implications. OpenAI's pricing, for instance, shows that fine-tuning GPT-3.5 Turbo costs $0.008 per 1,000 tokens. To put it into perspective, 1,000 tokens roughly equate to 750 words.

In short, fine-tuning can be expensive, with costs scaling directly with token usage. Planning and budgeting ahead—just as the Six Triple Eight meticulously sorted through their backlog—are key to success.

Code

import tiktoken

def cal_num_tokens_from_row(string:str,encoding_name:str)-> int:
  encoding = tiktoken.encoding_for_model(encoding_name)  
  num_tokens = len(encoding.encode(string))
  return num_tokens

def cal_num_tokens_from_df(df,encoding_name:str) -> int:
   total_tokens = 0
   for text in df['text']:
     total_tokens += cal_num_tokens_from_row(text,encoding_name)
   return total_tokens

total_tokens = cal_num_tokens_from_df(df,'gpt-3.5-turbo')
print(f"total {total_tokens}")

Based on the total token count, fine-tuning could cost around $8–$9, which might be prohibitively expensive for an individual. Planning and budgeting are essential to manage these costs effectively.

DEV Community

Counting Tokens: Sorting Through the Details

Top comments (0)

Read next

Understanding React Fiber: Enhancing Performance and User Experience in React

React libraries for building forms and surveys

The Three Golden Rules of Successful Product Development

Building Composable Platforms with Harmony