DEV Community

Cover image for BERT WordPiece Tokenizer Explained
James Briggs
James Briggs

Posted on

3 1

BERT WordPiece Tokenizer Explained

Building a transformer model from scratch can often be the only option for many more specific use cases. Although BERT and other transformer models have been pre-trained for many languages and domains, they do not cover everything.

Often, these less common use cases stand to gain the most from having someone come along and build a specific transformer model. It could be for an uncommon language or a less tech-savvy domain.

BERT is the most popular transformer for a wide range of language-based machine learning — from sentiment analysis to question and answering. BERT has enabled a diverse range of innovation across many borders and industries.

The first step for many in designing a new BERT model is the tokenizer. In this article, we’ll look at the WordPiece tokenizer used by BERT — and see how we can build our own from scratch.

Full walkthrough or free link if you don't have Medium!

AWS Q Developer image

Your AI Code Assistant

Implement features, document your code, or refactor your projects.
Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More