BERT WordPiece Tokenizer Explained

#machinelearning #datascience #python #ai

Building a transformer model from scratch can often be the only option for many more specific use cases. Although BERT and other transformer models have been pre-trained for many languages and domains, they do not cover everything.

Often, these less common use cases stand to gain the most from having someone come along and build a specific transformer model. It could be for an uncommon language or a less tech-savvy domain.

BERT is the most popular transformer for a wide range of language-based machine learning — from sentiment analysis to question and answering. BERT has enabled a diverse range of innovation across many borders and industries.

The first step for many in designing a new BERT model is the tokenizer. In this article, we’ll look at the WordPiece tokenizer used by BERT — and see how we can build our own from scratch.

Full walkthrough or free link if you don't have Medium!