As part of π€ Tokenizers 0.9 release, it has never been easier to create extremely fast and versatile tokenizers for your next NLP task.
No better way to showcase tokenizers' new capabilities than to create a Bert tokenizer from scratch.
Tokenizer
First, BERT relies on WordPiece, so we instantiate a new Tokenizer with this model:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
bert_tokenizer = Tokenizer(WordPiece())
Then we know that BERT preprocesses texts by removing accents and lowercasing. We also use a unicode normalizer:
from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
The pre-tokenizer is just splitting on whitespace and punctuation:
from tokenizers.pre_tokenizers import Whitespace
bert_tokenizer.pre_tokenizer = Whitespace()
And the post-processing uses the template we saw in the previous section:
from tokenizers.processors import TemplateProcessing
bert_tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", 1),
("[SEP]", 2),
],
)
We can use this tokenizer and train on it on wikitext like in the Quicktour:
from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(
vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
bert_tokenizer.train(trainer, files)
model_files = bert_tokenizer.model.save("data", "bert-wiki")
bert_tokenizer.model = WordPiece.from_file(*model_files, unk_token="[UNK]")
bert_tokenizer.save("data/bert-wiki.json")
Now that the BERT tokenizer has been configured and trained the BERT tokenizer, we can load it with:
from tokenizers import Tokenizer
bert_tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")
Decoding
On top of encoding the input texts, a Tokenizer
also has an API for decoding, that is converting IDs generated by your model back to a text. This is done by the methods decode()
(for one predicted text) and decode_batch()
(for a batch of predictions).
The decoder will first convert the IDs back to tokens (using the tokenizerβs vocabulary) and remove all special tokens, then join those tokens with spaces.
If you used a model that added special characters to represent subtokens of a given βwordβ (like the "##"
in WordPiece) you will need to customize the decoder to treat them properly. If we take our previous bert_tokenizer for instance the default decoding will give:
output = bert_tokenizer.encode("Welcome to the π€ Tokenizers library.")
print(output.tokens)
# ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"]
bert_tokenizer.decode(output.ids)
# "welcome to the tok ##eni ##zer ##s library ."
But by changing it to a proper decoder, we get:
from tokenizers import decoders
bert_tokenizer.decoder = decoders.WordPiece()
bert_tokenizer.decode(output.ids)
# "welcome to the tokenizers library."
Resources
Documentation: https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#all-together-a-bert-tokenizer-from-scratch
Colab: https://colab.research.google.com/github/tenexcoder/huggingface-tutorials/blob/main/BERT_tokenizer_from_scratch.ipynb
Gist: https://gist.github.com/tenexcoder/85b38e17a5557f0bb7c44bda4a08271d
Credit
All credit goes to Hugging Face Tokenizers Documentation β see resources for more details
I simply packaged the example in a digestible and shareable form.
Top comments (0)