nanasi

Posted on May 29

Building a Tokenizer 9.5x Faster than SentencePiece Unigram in Pure Rust 🦀

#rust #ai #llm #programming

Tokenization is one of those silent bottlenecks in the Large Language Model (LLM) world. While GPUs do the heavy lifting of running the model, the CPU is responsible for splitting raw text into token IDs.

In particular, the Unigram tokenization algorithm (popularized by Google's SentencePiece and used in models like T5 and Mistral) is notoriously CPU-intensive. During inference, it builds a lattice (a graph) of valid vocabulary matches and runs the Viterbi algorithm (a dynamic programming search) to find the path that maximizes joint probability.

While mathematically elegant, this graph search and per-input working memory overhead limits throughput.

What if we could move that complexity from runtime to distillation time?

Meet Tenya, an experimental Compiled Distilled Unigram (CDU) tokenizer written in Rust that tokenizes text at 23 Million tokens per second—achieving a 9.5x speedup over SentencePiece's core engine, with zero heap allocations in the hot path.

Here is how it works, the code behind it, and the performance trade-offs.

🛠️ The Bottleneck: Probabilistic Dynamic Programming

Under standard SentencePiece Unigram, tokenizing a string requires evaluating multiple matching subwords. For a word like tokenization, the dictionary might contain token, to, ken, iz, and ation.

To choose the best split, standard Unigram:

Builds a graph of possible cuts based on vocabulary overlaps.
Assigns log-probability scores to each edge.
Runs a Viterbi path-search to find the path with the highest joint likelihood.

Even with memory-optimized lattices, doing a graph-search dynamically for every single text string creates overhead.

🧠 The Tenya Solution: Compiled Distilled Unigrams (CDU)

Tenya shifts this paradigm. It asks: What if we did the dynamic programming once during training/export, and used a simple, deterministic search during inference?

The architecture works in three phases:

Train: Train a SentencePiece Unigram model (the \"Teacher\") on a corpus.
Distill: Tokenize the corpus with the teacher to count how often each vocabulary piece is actually selected. We then calculate a static priority score for each token: $$\text{Priority} = \text{Teacher Log-Probability} + \alpha \cdot \ln(\text{Corpus Frequency} + 1)$$
Compile: Export this vocabulary to a JSON file, which Tenya loads and flattens into a contiguous, cache-friendly prefix Trie in RAM.

At runtime, Tenya does a single forward pass over the bytes of the text, matching character sequences greedily using this flat Trie and resolving conflicts using the pre-computed priorities.

🦀 Inside the Rust Core: Zero Allocation lookups

To make Tenya run as fast as possible, we implemented it in Rust with zero heap allocations in the hot path.

Instead of using pointer-based nodes allocated all over the heap, the prefix tree is compiled into a flat vector of contiguous memory nodes:

pub struct TrieNode {
    pub token_id: Option<u32>,
    pub priority: f64,
    pub children: Vec<(u8, u32)>, // (byte, next_node_index)
}

Since the children of each node are sorted by their byte values, transitioning to the next node is just a cache-friendly binary search on a contiguous memory block.

During tokenization, the main loop traverses this Trie byte-by-byte, resolving matches and pushing token IDs into a pre-allocated vector:

pub fn encode_bytes_into(&self, bytes: &[u8], output: &mut Vec<u32>) {
    let mut offset = 0;
    let len = bytes.len();

    while offset < len {
        let mut best_match: Option<(u32, usize, f64)> = None;

        self.trie.common_prefix_search(&bytes[offset..], |token_id, priority, match_len| {
            // Pick the match with the highest distilled priority score
            let update = match best_match {
                None => true,
                Some((_, best_len, best_priority)) => {
                    priority > best_priority || (priority == best_priority && match_len > best_len)
                }
            };
            if update {
                best_match = Some((token_id, match_len, priority));
            }
        });

        if let Some((token_id, match_len, _)) = best_match {
            output.push(token_id);
            offset += match_len;
        } else {
            // Byte fallback for unknown bytes
            let byte = bytes[offset];
            output.push(self.fallback.lookup(byte));
            offset += 1;
        }
    }
}

🔄 Guaranteed 100% Reversibility

Many greedy tokenizers lose spaces or fail on emojis. Tenya guarantees 100% round-trip reversibility by reserving the final 256 IDs in its vocabulary for explicit byte fallbacks. If a character sequence is not in the Trie, it falls back to the exact bytes of the characters. When decoding, these bytes are reconstructed directly back into valid UTF-8.

🏁 Empirical Benchmarks

We benchmarked Tenya against Google's SWIG-wrapped SentencePiece Unigram C++ library on a validation corpus consisting of mixed prose, numbers, and code:

Metric	SentencePiece Unigram (Teacher)	Tenya (LongestMatch)	Tenya (PriorityMatch)
p50 Latency (ms)	0.415 ms	0.047 ms (8.8x faster)	0.124 ms (3.3x faster)
p95 Latency (ms)	0.890 ms	0.056 ms	0.149 ms
Throughput (MB/s)	3.67 MB/s	32.21 MB/s	12.25 MB/s
Throughput (Tokens/sec)	1.67 M tokens/s	23.30 M tokens/s	11.28 M tokens/s
Teacher F1 Agreement	100.00% (Baseline)	38.57%	43.80%
Reversibility Rate	100.00%	100.00%	100.00%

Why is there a speed difference between strategies?

LongestMatch is the fastest (23.30M tokens/s). It takes large steps through the text, matching long words, which means it rarely restarts its search from the root of the Trie.
PriorityMatch is slower (11.28M tokens/s) but matches the teacher's style much closer (43.80% F1 alignment). Because it prefers high-priority short sub-words (like common syllables), it has to restart its search from the root node more frequently.

⚖️ The Trade-offs

Is this a free lunch? No.

Vocabulary Fragmentation: Because Tenya uses a greedy prefix matcher instead of dynamic programming path optimization, it tends to split words into slightly smaller pieces. Tenya generates 1.5x to 1.9x more tokens than the teacher, meaning the text representations are longer.
Not a Drop-in Replacement: You cannot plug Tenya directly into an already-trained LLaMA model because the word boundaries will mismatch, resulting in garbage outputs.

Where Tenya excels is when you are training a new model from scratch, building high-throughput dataset loading pipelines, or running local inference on highly resource-constrained IoT/edge hardware.

🚀 Try it yourself!

The repository is open-source and includes the entire Python training pipeline, the Rust core runtime, the command-line CLI, and Criterion benchmark targets.

Get started by cloning the repo:

git clone https://github.com/spellsaif/tenya.git
cd tenya

Run the pipeline to train, distill, and export:

# Set up venv
python3 -m venv .venv
source .venv/bin/activate
pip install sentencepiece

# Train teacher & distill policy
python3 python/train_teacher.py --input data/sample.txt --vocab-size 500 --model-prefix teacher
python3 python/distill_policy.py --input data/sample.txt --model teacher.model --output policy.json
python3 python/export_tenya_vocab.py --model teacher.model --policy policy.json --output vocab.json

Tokenize text using the Rust CLI:

cargo run --release -p tenya-cli -- --vocab vocab.json encode --text \"hello world\"

Check out the full repository here:
👉 github.com/spellsaif/tenya

Let me know what you think of this architecture in the comments!

Top comments (1)

Harjot Singh • May 31

A 9.5x speedup on Unigram tokenization in Rust is the kind of invisible-infrastructure win that matters more than it sounds, because tokenization sits on the hot path of literally every call, training and inference, and nobody profiles it until it's the bottleneck. It's the classic boring-but-load-bearing layer: not glamorous, but you pay for it on every single request, so a near-10x cut compounds across an entire pipeline. The pure-Rust angle is the right instinct too, this is exactly the kind of deterministic, CPU-bound, no-reasoning-needed work that should be fast native code rather than living in a slow Python loop, and getting it out of the interpreter is often where the order-of-magnitude hides. The detail I'd be curious about is correctness parity, the only scary part of a faster tokenizer is silent divergence from the reference, a tokenization that's 9.5x faster but disagrees with SentencePiece on even a small fraction of inputs corrupts everything downstream invisibly, so the byte-exact match against the reference is what makes the speed actually usable. Fast plus provably-identical-output is the combo. That make-the-boring-hot-path-fast-without-changing-its-behavior instinct is core to how I think about systems in Moonshift. How are you verifying output parity with SentencePiece, exhaustive diff on a corpus, or property-based testing on the merges?