DEV Community

Cover image for Counting Tokens: Sorting Through the Details
es404020
es404020

Posted on

Counting Tokens: Sorting Through the Details

The women of the Six Triple Eight faced a monumental challenge: deciphering incomplete addresses, nicknames, and smudged handwriting under strict time constraints. Similarly, when fine-tuning custom data with OpenAI data, understanding token usage is crucial—not only to ensure the model can handle complex tasks but also to manage costs effectively.

Using Tiktoken, we calculate the token count in our text data to stay within OpenAI's token limits and optimize efficiency. Fine-tuning a model isn’t just a technical challenge; it comes with financial implications. OpenAI's pricing, for instance, shows that fine-tuning GPT-3.5 Turbo costs $0.008 per 1,000 tokens. To put it into perspective, 1,000 tokens roughly equate to 750 words.

In short, fine-tuning can be expensive, with costs scaling directly with token usage. Planning and budgeting ahead—just as the Six Triple Eight meticulously sorted through their backlog—are key to success.

Code

import tiktoken

def cal_num_tokens_from_row(string:str,encoding_name:str)-> int:
  encoding = tiktoken.encoding_for_model(encoding_name)  
  num_tokens = len(encoding.encode(string))
  return num_tokens

def cal_num_tokens_from_df(df,encoding_name:str) -> int:
   total_tokens = 0
   for text in df['text']:
     total_tokens += cal_num_tokens_from_row(text,encoding_name)
   return total_tokens

total_tokens = cal_num_tokens_from_df(df,'gpt-3.5-turbo')
print(f"total {total_tokens}")
Enter fullscreen mode Exit fullscreen mode

Based on the total token count, fine-tuning could cost around $8–$9, which might be prohibitively expensive for an individual. Planning and budgeting are essential to manage these costs effectively.

Top comments (0)