🤗 BERT tokenizer from scratch

#nlp #deeplearning #machinelearning #opensource

As part of 🤗 Tokenizers 0.9 release, it has never been easier to create extremely fast and versatile tokenizers for your next NLP task.
No better way to showcase tokenizers' new capabilities than to create a Bert tokenizer from scratch.

Tokenizer

First, BERT relies on WordPiece, so we instantiate a new Tokenizer with this model:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece

bert_tokenizer = Tokenizer(WordPiece())

Then we know that BERT preprocesses texts by removing accents and lowercasing. We also use a unicode normalizer:

from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents

bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])

The pre-tokenizer is just splitting on whitespace and punctuation:

from tokenizers.pre_tokenizers import Whitespace

bert_tokenizer.pre_tokenizer = Whitespace()

And the post-processing uses the template we saw in the previous section:

from tokenizers.processors import TemplateProcessing

bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

We can use this tokenizer and train on it on wikitext like in the Quicktour:

from tokenizers.trainers import WordPieceTrainer

trainer = WordPieceTrainer(
    vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
bert_tokenizer.train(trainer, files)

model_files = bert_tokenizer.model.save("data", "bert-wiki")
bert_tokenizer.model = WordPiece.from_file(*model_files, unk_token="[UNK]")

bert_tokenizer.save("data/bert-wiki.json")

Now that the BERT tokenizer has been configured and trained the BERT tokenizer, we can load it with:

from tokenizers import Tokenizer

bert_tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")

Decoding

On top of encoding the input texts, a Tokenizer also has an API for decoding, that is converting IDs generated by your model back to a text. This is done by the methods decode() (for one predicted text) and decode_batch() (for a batch of predictions).
The decoder will first convert the IDs back to tokens (using the tokenizer’s vocabulary) and remove all special tokens, then join those tokens with spaces.
If you used a model that added special characters to represent subtokens of a given “word” (like the "##" in WordPiece) you will need to customize the decoder to treat them properly. If we take our previous bert_tokenizer for instance the default decoding will give:

output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
print(output.tokens)
# ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"]

bert_tokenizer.decode(output.ids)
# "welcome to the tok ##eni ##zer ##s library ."

But by changing it to a proper decoder, we get:

from tokenizers import decoders

bert_tokenizer.decoder = decoders.WordPiece()
bert_tokenizer.decode(output.ids)
# "welcome to the tokenizers library."

Resources

Documentation: https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#all-together-a-bert-tokenizer-from-scratch
Colab: https://colab.research.google.com/github/tenexcoder/huggingface-tutorials/blob/main/BERT_tokenizer_from_scratch.ipynb
Gist: https://gist.github.com/tenexcoder/85b38e17a5557f0bb7c44bda4a08271d

Credit

All credit goes to Hugging Face Tokenizers Documentation — see resources for more details
I simply packaged the example in a digestible and shareable form.

DEV Community

🤗 BERT tokenizer from scratch

Tokenizer

Decoding

Resources

Credit

Top comments (0)

Read next

Generate Lifelike 3D Avatars with URAvatar's Neural Rendering Technology

A Perfect Synergy Between Data Integration Technology and Vector Databases!

Synchronizing Data from InfluxDB to Doris with SeaTunnel

How to Install PySpark on Your Local Machine