DEV Community

Cover image for Why are there so many tokenization methods in Transformers?
James Briggs
James Briggs

Posted on

Why are there so many tokenization methods in Transformers?

HuggingFace's transformers library is the de-facto standard for NLP - used by practitioners worldwide, it's powerful, flexible, and easy to use. It achieves this through a fairly large (and complex) code-base, which has resulted in the question:

"Why are there so many tokenization methods in HuggingFace transformers?"

Tokenization is the process of encoding a string of text into transformer-readable token ID integers. In this video we cover five different methods for this - do these all produce the same output, or is there a difference between them?

📙 Check out the Medium article or if you don't have Medium membership here's a free access link

I also made a NLP with Transformers course, here's 70% off if you're interested!

Discussion (0)