DEV Community ๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป

DEV Community ๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป is a community of 967,911 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

Create account Log in
Cover image for Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
James Briggs

Posted on

Sentiment Analysis on ANY Length of Text With Transformers (Python)

The de-facto standard in many natural language processing (NLP) tasks nowadays is to use a transformer. Text generation? Transformer. Question-and-answering? Transformer. Language classification? Transformer!

However, one of the problems with many of these models (a problem that is not just restricted to transformer models) is that we cannot process long pieces of text.

Almost every article I write on Medium contains 1000+ words, which, when tokenized for a transformer model like BERT, will produce 1000+ tokens. BERT (and many other transformer models) will consume 512 tokens maxโ€Š-โ€Štruncating anything beyond this length.

Although I think you may struggle to find value in processing my Medium articles, the same applies to many useful data sourcesโ€Š-โ€Šlike news articles or Reddit posts.

We will take a look at how we can work around this limitation. In this article, we will find the sentiment for long posts from the /r/investing subreddit. This video will cover:

High-Level Approach
Getting Started

Data
Initialization

  • Tokenization Preparing The Chunks
  • Split
  • CLS and SEP
  • Padding
  • Reshaping For BERT Making Predictions

Top comments (1)

Collapse
 
amananandrai profile image
amananandrai

A very nice and helpful tutorial. Thanks for making this.

๐ŸŒš Browsing with dark mode makes you a better developer.

It's a scientific fact.