DEV Community: nanasi

Building a Tokenizer 9.5x Faster than SentencePiece Unigram in Pure Rust 🦀

nanasi — Fri, 29 May 2026 04:11:55 +0000

Tokenization is one of those silent bottlenecks in the Large Language Model (LLM) world. While GPUs do the heavy lifting of running the model, the CPU is responsible for splitting raw text into token IDs.

In particular, the Unigram tokenization algorithm (popularized by Google's SentencePiece and used in models like T5 and Mistral) is notoriously CPU-intensive. During inference, it builds a lattice (a graph) of valid vocabulary matches and runs the Viterbi algorithm (a dynamic programming search) to find the path that maximizes joint probability.

While mathematically elegant, this graph search and per-input working memory overhead limits throughput.

What if we could move that complexity from runtime to distillation time?

Meet Tenya, an experimental Compiled Distilled Unigram (CDU) tokenizer written in Rust that tokenizes text at 23 Million tokens per second—achieving a 9.5x speedup over SentencePiece's core engine, with zero heap allocations in the hot path.

Here is how it works, the code behind it, and the performance trade-offs.

🛠️ The Bottleneck: Probabilistic Dynamic Programming

Under standard SentencePiece Unigram, tokenizing a string requires evaluating multiple matching subwords. For a word like tokenization, the dictionary might contain token, to, ken, iz, and ation.

To choose the best split, standard Unigram:

Builds a graph of possible cuts based on vocabulary overlaps.
Assigns log-probability scores to each edge.
Runs a Viterbi path-search to find the path with the highest joint likelihood.

Even with memory-optimized lattices, doing a graph-search dynamically for every single text string creates overhead.

🧠 The Tenya Solution: Compiled Distilled Unigrams (CDU)

Tenya shifts this paradigm. It asks: What if we did the dynamic programming once during training/export, and used a simple, deterministic search during inference?

The architecture works in three phases:

Train: Train a SentencePiece Unigram model (the \"Teacher\") on a corpus.
Distill: Tokenize the corpus with the teacher to count how often each vocabulary piece is actually selected. We then calculate a static priority score for each token: $$\text{Priority} = \text{Teacher Log-Probability} + \alpha \cdot \ln(\text{Corpus Frequency} + 1)$$
Compile: Export this vocabulary to a JSON file, which Tenya loads and flattens into a contiguous, cache-friendly prefix Trie in RAM.

At runtime, Tenya does a single forward pass over the bytes of the text, matching character sequences greedily using this flat Trie and resolving conflicts using the pre-computed priorities.

🦀 Inside the Rust Core: Zero Allocation lookups

To make Tenya run as fast as possible, we implemented it in Rust with zero heap allocations in the hot path.

Instead of using pointer-based nodes allocated all over the heap, the prefix tree is compiled into a flat vector of contiguous memory nodes:

pub struct TrieNode {
    pub token_id: Option<u32>,
    pub priority: f64,
    pub children: Vec<(u8, u32)>, // (byte, next_node_index)
}

Since the children of each node are sorted by their byte values, transitioning to the next node is just a cache-friendly binary search on a contiguous memory block.

During tokenization, the main loop traverses this Trie byte-by-byte, resolving matches and pushing token IDs into a pre-allocated vector:

pub fn encode_bytes_into(&self, bytes: &[u8], output: &mut Vec<u32>) {
    let mut offset = 0;
    let len = bytes.len();

    while offset < len {
        let mut best_match: Option<(u32, usize, f64)> = None;

        self.trie.common_prefix_search(&bytes[offset..], |token_id, priority, match_len| {
            // Pick the match with the highest distilled priority score
            let update = match best_match {
                None => true,
                Some((_, best_len, best_priority)) => {
                    priority > best_priority || (priority == best_priority && match_len > best_len)
                }
            };
            if update {
                best_match = Some((token_id, match_len, priority));
            }
        });

        if let Some((token_id, match_len, _)) = best_match {
            output.push(token_id);
            offset += match_len;
        } else {
            // Byte fallback for unknown bytes
            let byte = bytes[offset];
            output.push(self.fallback.lookup(byte));
            offset += 1;
        }
    }
}

🔄 Guaranteed 100% Reversibility

Many greedy tokenizers lose spaces or fail on emojis. Tenya guarantees 100% round-trip reversibility by reserving the final 256 IDs in its vocabulary for explicit byte fallbacks. If a character sequence is not in the Trie, it falls back to the exact bytes of the characters. When decoding, these bytes are reconstructed directly back into valid UTF-8.

🏁 Empirical Benchmarks

We benchmarked Tenya against Google's SWIG-wrapped SentencePiece Unigram C++ library on a validation corpus consisting of mixed prose, numbers, and code:

Metric	SentencePiece Unigram (Teacher)	Tenya (LongestMatch)	Tenya (PriorityMatch)
p50 Latency (ms)	0.415 ms	0.047 ms (8.8x faster)	0.124 ms (3.3x faster)
p95 Latency (ms)	0.890 ms	0.056 ms	0.149 ms
Throughput (MB/s)	3.67 MB/s	32.21 MB/s	12.25 MB/s
Throughput (Tokens/sec)	1.67 M tokens/s	23.30 M tokens/s	11.28 M tokens/s
Teacher F1 Agreement	100.00% (Baseline)	38.57%	43.80%
Reversibility Rate	100.00%	100.00%	100.00%

Why is there a speed difference between strategies?

LongestMatch is the fastest (23.30M tokens/s). It takes large steps through the text, matching long words, which means it rarely restarts its search from the root of the Trie.
PriorityMatch is slower (11.28M tokens/s) but matches the teacher's style much closer (43.80% F1 alignment). Because it prefers high-priority short sub-words (like common syllables), it has to restart its search from the root node more frequently.

⚖️ The Trade-offs

Is this a free lunch? No.

Vocabulary Fragmentation: Because Tenya uses a greedy prefix matcher instead of dynamic programming path optimization, it tends to split words into slightly smaller pieces. Tenya generates 1.5x to 1.9x more tokens than the teacher, meaning the text representations are longer.
Not a Drop-in Replacement: You cannot plug Tenya directly into an already-trained LLaMA model because the word boundaries will mismatch, resulting in garbage outputs.

Where Tenya excels is when you are training a new model from scratch, building high-throughput dataset loading pipelines, or running local inference on highly resource-constrained IoT/edge hardware.

🚀 Try it yourself!

The repository is open-source and includes the entire Python training pipeline, the Rust core runtime, the command-line CLI, and Criterion benchmark targets.

Get started by cloning the repo:

git clone https://github.com/spellsaif/tenya.git
cd tenya

Run the pipeline to train, distill, and export:

# Set up venv
python3 -m venv .venv
source .venv/bin/activate
pip install sentencepiece

# Train teacher & distill policy
python3 python/train_teacher.py --input data/sample.txt --vocab-size 500 --model-prefix teacher
python3 python/distill_policy.py --input data/sample.txt --model teacher.model --output policy.json
python3 python/export_tenya_vocab.py --model teacher.model --policy policy.json --output vocab.json

Tokenize text using the Rust CLI:

cargo run --release -p tenya-cli -- --vocab vocab.json encode --text \"hello world\"

Check out the full repository here:
👉 github.com/spellsaif/tenya

Let me know what you think of this architecture in the comments!

I stopped trusting middleware for everything (almost)

nanasi — Wed, 25 Mar 2026 06:06:06 +0000

Not because middleware is bad.

But because I was using it for things it was never meant to guarantee.

The pattern we all write

app.use(authMiddleware)

app.get("/me", (c) => {
  return c.json(c.get("user"))
})

This works. It’s simple. It’s familiar.

But it relies on an assumption:

“user will be there when I need it.”

And nothing actually enforces that.

Where things break

A few very normal mistakes:

You forget to apply middleware to a route
You register it in the wrong order
You refactor something and break the chain

Everything still compiles. The app still runs.

The failure shows up later — usually when it matters.

Middleware isn’t the problem

Frameworks like Hono and Elysia do middleware really well:

Hono keeps things minimal and close to Web standards
Elysia pushes type safety further than most frameworks

And middleware itself is great for:

logging
compression
request/response transformations

That’s exactly what it’s designed for.

The real issue

The problem is when we use middleware for something else:

data dependencies

When a handler depends on user, that dependency is implicit.

It’s not declared anywhere. It’s just assumed.

What I tried instead

Instead of replacing middleware, I separated concerns:

Middleware → handles flow
Relics → enforce what must exist

Defining a contract

const UserCtx = token<{ id: string }>("user")

const authRelic = relic(UserCtx, async (ctx) => {
  const user = await verify(ctx.req)
  if (!user) return err(Unauthorized)
  return user
})

This says:

“I provide UserCtx, or I fail.”

Using it in routes

app.scope("/user", authRelic, (r) => {
  r.get("/me", (ctx) => {
    return ctx.json(ctx.relic(UserCtx))
  })
})

Now the handler doesn’t assume anything.

If it runs, the dependency is already satisfied.

What changed

Before:

“this should exist”

Now:

“this must exist”

And if it doesn’t, the app fails at startup.

Comparing approaches (in good faith)

Concern	Middleware (Hono / Elysia)	Relics (Tomoe)
Flow control	Excellent	Not the goal
Cross-cutting concerns	Excellent	Not the goal
Data dependencies	Implicit	Explicit
Guarantees	None	Enforced
Failure timing	Runtime	Startup

They’re not competing tools — they solve different problems.

Why this felt better

Most of my bugs weren’t about routing or performance.

They were about assumptions.

Middleware made those assumptions easy to write.

Relics made them explicit.

Final thought

I didn’t replace middleware.

I stopped asking it to do something it was never designed to do.

I’m building this idea into a small framework called Tomoe.

Still early, but I’d love feedback:

👉 https://github.com/Project-Tomoe/tomoe