DEV Community

Wayne
Wayne

Posted on • Originally published at wheynelau.dev

Using hf tokenizers in Rust

The tokenizers library from Hugging Face provides an efficient way to work with text tokenization in Rust. This guide shows you how to get started with pretrained tokenizers.

Setup

First, add the tokenizer library to your project:

cargo add tokenizers --features http,hf-hub
Enter fullscreen mode Exit fullscreen mode

Basic Usage

Here's a complete example that loads a pretrained tokenizer and processes text:

use tokenizers::Tokenizer;

fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    // Load a pretrained tokenizer
    let tokenizer = Tokenizer::from_pretrained("hf-internal-testing/llama-tokenizer", None)?;

    let text = "This is a sample string to tokenize";

    // Encode the text (false = no special tokens)
    let encoding = tokenizer.encode(text, false)?;

    // Get token IDs
    let token_ids = encoding.get_ids();
    println!("Token IDs: {:?}", token_ids);

    // Get token text
    let tokens = encoding.get_tokens();
    println!("Tokens: {:?}", tokens);

    println!("Original: {}", text);
    println!("Number of tokens: {}", token_ids.len());

    let decoded = tokenizer.decode(token_ids, true)?;
    println!("Original: {}", text);
    println!("Decoded: {}", decoded);

    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Working with Different Models

You can use various pretrained models:

// GPT-2 tokenizer
let gpt_tokenizer = Tokenizer::from_pretrained("gpt2", None)?;

// BERT tokenizer
let bert_tokenizer = Tokenizer::from_pretrained("bert-base-uncased", None)?;

// Llama tokenizer
let llama_tokenizer = Tokenizer::from_pretrained("hf-internal-testing/llama-tokenizer", None)?;
Enter fullscreen mode Exit fullscreen mode

Configuration

To change the cache directory for downloaded models, set the HF_HOME environment variable:

export HF_HOME=/path/to/your/cache
Enter fullscreen mode Exit fullscreen mode

Setting environment variables programmatically is not recommended as it requires an unsafe block.

Private Repositories

If you encounter this error:

Error: RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/google/gemma-3-12b-it/resolve/main/tokenizer.json]))
Enter fullscreen mode Exit fullscreen mode

It means you are not authenticated and may require a token. There are two ways to achieve this:

  1. Write your token to $HF_HOME/token, usually $HOME/.cache/huggingface
  2. Within Rust code:
use tokenizers::{Tokenizer, FromPretrainedParameters};

let params = FromPretrainedParameters {
    token: Some("<your very secret token>".to_string()),
    ..Default::default()
};
let tokenizer = Tokenizer::from_pretrained("google/gemma-3-4b-it", Some(params))?;
Enter fullscreen mode Exit fullscreen mode

Note that you may still need to get permission to access the repos.

Branches

You can specify a specific branch or revision:

use tokenizers::{Tokenizer, FromPretrainedParameters};

let params = FromPretrainedParameters {
    revision: "main".to_string(),  // or specific commit hash
    ..Default::default()
};
let tokenizer = Tokenizer::from_pretrained("google/gemma-3-4b-it", Some(params))?;
Enter fullscreen mode Exit fullscreen mode

User-Agent

Params have another variable called user_agent for customizing the HTTP client user agent string.

use tokenizers::{Tokenizer, FromPretrainedParameters};

let params = FromPretrainedParameters {
    user_agent: Some("my-rust-app/1.0".to_string()),
    ..Default::default()
};
let tokenizer = Tokenizer::from_pretrained("gpt2", Some(params))?;
Enter fullscreen mode Exit fullscreen mode

Summary

The Hugging Face tokenizers library provides a robust, production-ready solution for text processing in Rust applications. With support for pretrained models, authentication for private repositories, and flexible configuration options, it's an excellent choice for NLP workflows in Rust.

You can find this post and more on my blog.

Top comments (0)