Tokenization is one of those silent bottlenecks in the Large Language Model (LLM) world. While GPUs do the heavy lifting of running the model, the CPU is responsible for splitting raw text into token IDs.
In particular, the Unigram tokenization algorithm (popularized by Google's SentencePiece and used in models like T5 and Mistral) is notoriously CPU-intensive. During inference, it builds a lattice (a graph) of valid vocabulary matches and runs the Viterbi algorithm (a dynamic programming search) to find the path that maximizes joint probability.
While mathematically elegant, this graph search and per-input working memory overhead limits throughput.
What if we could move that complexity from runtime to distillation time?
Meet Tenya, an experimental Compiled Distilled Unigram (CDU) tokenizer written in Rust that tokenizes text at 23 Million tokens per secondβachieving a 9.5x speedup over SentencePiece's core engine, with zero heap allocations in the hot path.
Here is how it works, the code behind it, and the performance trade-offs.
π οΈ The Bottleneck: Probabilistic Dynamic Programming
Under standard SentencePiece Unigram, tokenizing a string requires evaluating multiple matching subwords. For a word like tokenization, the dictionary might contain token, to, ken, iz, and ation.
To choose the best split, standard Unigram:
- Builds a graph of possible cuts based on vocabulary overlaps.
- Assigns log-probability scores to each edge.
- Runs a Viterbi path-search to find the path with the highest joint likelihood.
Even with memory-optimized lattices, doing a graph-search dynamically for every single text string creates overhead.
π§ The Tenya Solution: Compiled Distilled Unigrams (CDU)
Tenya shifts this paradigm. It asks: What if we did the dynamic programming once during training/export, and used a simple, deterministic search during inference?
The architecture works in three phases:
- Train: Train a SentencePiece Unigram model (the \"Teacher\") on a corpus.
- Distill: Tokenize the corpus with the teacher to count how often each vocabulary piece is actually selected. We then calculate a static priority score for each token: $$\text{Priority} = \text{Teacher Log-Probability} + \alpha \cdot \ln(\text{Corpus Frequency} + 1)$$
- Compile: Export this vocabulary to a JSON file, which Tenya loads and flattens into a contiguous, cache-friendly prefix Trie in RAM.
At runtime, Tenya does a single forward pass over the bytes of the text, matching character sequences greedily using this flat Trie and resolving conflicts using the pre-computed priorities.
π¦ Inside the Rust Core: Zero Allocation lookups
To make Tenya run as fast as possible, we implemented it in Rust with zero heap allocations in the hot path.
Instead of using pointer-based nodes allocated all over the heap, the prefix tree is compiled into a flat vector of contiguous memory nodes:
pub struct TrieNode {
pub token_id: Option<u32>,
pub priority: f64,
pub children: Vec<(u8, u32)>, // (byte, next_node_index)
}
Since the children of each node are sorted by their byte values, transitioning to the next node is just a cache-friendly binary search on a contiguous memory block.
During tokenization, the main loop traverses this Trie byte-by-byte, resolving matches and pushing token IDs into a pre-allocated vector:
pub fn encode_bytes_into(&self, bytes: &[u8], output: &mut Vec<u32>) {
let mut offset = 0;
let len = bytes.len();
while offset < len {
let mut best_match: Option<(u32, usize, f64)> = None;
self.trie.common_prefix_search(&bytes[offset..], |token_id, priority, match_len| {
// Pick the match with the highest distilled priority score
let update = match best_match {
None => true,
Some((_, best_len, best_priority)) => {
priority > best_priority || (priority == best_priority && match_len > best_len)
}
};
if update {
best_match = Some((token_id, match_len, priority));
}
});
if let Some((token_id, match_len, _)) = best_match {
output.push(token_id);
offset += match_len;
} else {
// Byte fallback for unknown bytes
let byte = bytes[offset];
output.push(self.fallback.lookup(byte));
offset += 1;
}
}
}
π Guaranteed 100% Reversibility
Many greedy tokenizers lose spaces or fail on emojis. Tenya guarantees 100% round-trip reversibility by reserving the final 256 IDs in its vocabulary for explicit byte fallbacks. If a character sequence is not in the Trie, it falls back to the exact bytes of the characters. When decoding, these bytes are reconstructed directly back into valid UTF-8.
π Empirical Benchmarks
We benchmarked Tenya against Google's SWIG-wrapped SentencePiece Unigram C++ library on a validation corpus consisting of mixed prose, numbers, and code:
| Metric | SentencePiece Unigram (Teacher) | Tenya (LongestMatch) | Tenya (PriorityMatch) |
|---|---|---|---|
| p50 Latency (ms) | 0.415 ms | 0.047 ms (8.8x faster) | 0.124 ms (3.3x faster) |
| p95 Latency (ms) | 0.890 ms | 0.056 ms | 0.149 ms |
| Throughput (MB/s) | 3.67 MB/s | 32.21 MB/s | 12.25 MB/s |
| Throughput (Tokens/sec) | 1.67 M tokens/s | 23.30 M tokens/s | 11.28 M tokens/s |
| Teacher F1 Agreement | 100.00% (Baseline) | 38.57% | 43.80% |
| Reversibility Rate | 100.00% | 100.00% | 100.00% |
Why is there a speed difference between strategies?
- LongestMatch is the fastest (23.30M tokens/s). It takes large steps through the text, matching long words, which means it rarely restarts its search from the root of the Trie.
- PriorityMatch is slower (11.28M tokens/s) but matches the teacher's style much closer (43.80% F1 alignment). Because it prefers high-priority short sub-words (like common syllables), it has to restart its search from the root node more frequently.
βοΈ The Trade-offs
Is this a free lunch? No.
- Vocabulary Fragmentation: Because Tenya uses a greedy prefix matcher instead of dynamic programming path optimization, it tends to split words into slightly smaller pieces. Tenya generates 1.5x to 1.9x more tokens than the teacher, meaning the text representations are longer.
- Not a Drop-in Replacement: You cannot plug Tenya directly into an already-trained LLaMA model because the word boundaries will mismatch, resulting in garbage outputs.
Where Tenya excels is when you are training a new model from scratch, building high-throughput dataset loading pipelines, or running local inference on highly resource-constrained IoT/edge hardware.
π Try it yourself!
The repository is open-source and includes the entire Python training pipeline, the Rust core runtime, the command-line CLI, and Criterion benchmark targets.
Get started by cloning the repo:
git clone https://github.com/spellsaif/tenya.git
cd tenya
Run the pipeline to train, distill, and export:
# Set up venv
python3 -m venv .venv
source .venv/bin/activate
pip install sentencepiece
# Train teacher & distill policy
python3 python/train_teacher.py --input data/sample.txt --vocab-size 500 --model-prefix teacher
python3 python/distill_policy.py --input data/sample.txt --model teacher.model --output policy.json
python3 python/export_tenya_vocab.py --model teacher.model --policy policy.json --output vocab.json
Tokenize text using the Rust CLI:
cargo run --release -p tenya-cli -- --vocab vocab.json encode --text \"hello world\"
Check out the full repository here:
π github.com/spellsaif/tenya
Let me know what you think of this architecture in the comments!
Top comments (0)