Apify for Apify

Posted on Nov 1, 2023 • Originally published at blog.apify.com on Aug 1, 2023

Traditional NLP techniques and the rise of LLMs

#ai #llms #machinelearning

The field of NLP has changed with the rise of LLMs, but NLP still has a role to play. Apply NLP techniques to scraped data and learn about tokenization, stemming, lemmatization, removing stop words, and more NLP techniques.

The applications of NLP are increasing day by day, starting from voice assistants like Alexa and Siri, automatic text summarization, and sentiment analysis. Even YouTube and Netflix use NLP to suggest what we should watch next. These applications were not easy to implement during the early days of NLP, which relied on traditional approaches. Recent advancements in the field of NLP have made things easier to implement and use.

But our goal for today is to learn about some basic NLP techniques, as these definitely still have their uses in the era of LLMs. In this article, well use Twitter data scraped using the web scraping and automation platform, Apify and implement NLP techniques on that data.

🤖 Get reliable data for NLP and LLMs fast!

How to set up the environment

To start setting up NLP and Apify, you'll need to create a new directory and a Python file. You can do this by opening your terminal or command line and entering the following commands:

mkdir NLP
cd NLP
touch main.ipynb

Let's install the packages. Copy the command below, paste it into your terminal, and press Enter.

pip3 install apify-client nltk pandas scikit-learn spacy

This should install the dependencies in your system. To confirm that everything is installed properly, you can enter the following command in your terminal:

pip3 freeze | egrep '(apify-client|nltk|pandas|scikit-learn|spacy)'

This should include both the dependencies with their versions. If you spot any missing dependencies, you may need to re-run the installation command for that specific package.

Once we're done with the installation, we're ready to write our code.

📔 Note: We will be working in a notebook. Consider every code block a notebook cell.

Scrape data with a Twitter scraper

Well scrape the data for these techniques using Tweet Flash - Twitter Scraper, a data extraction tool to scrape tweets from any public profile. In this example scenario, well extract tweets from The New York Times Twitter account.

from apify_client import ApifyClient

# Initialize the ApifyClient with your API token
client = ApifyClient("Apify_API_Key")

# Prepare the actor input
run_input = {
  "max_tweets": 500,
  "language": "any",
  "user_info": "user info and replying info",
  "max_attempts": 5,
  # You can provide a list of profiles here
  "from_user": ['nytimes'],
  "only_tweets":True
}

# Run the Actor and wait for it to finish
run = client.actor("shanes/tweet-flash").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item['text'])

This should print a list of 500 tweets from The New York Times Twitter profile. Then well convert this data into a Python DataFrame.

import pandas as pd
# Initialize an empty list to store the tweets
tweets = []
# Iterate over the items in the dataset
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    # Add the 'text' field of each item to the list
    tweets.append(item['text'])
# Convert the list to a DataFrame
df = pd.DataFrame(tweets, columns=['Tweet'])
# Print the DataFrame
print(df)

The conversion of this dictionary data into a DataFrame will allow us to manipulate the data more efficiently and conveniently.

Now our data is ready, and we can start applying different techniques to this DataFrame.

Related How to scrape Twitter data

Removing special characters

One of the first techniques we often use when working with raw data is to remove special characters like punctuation marks, symbols, etc. These characters don't carry any meaningful information and are typically irrelevant to the NLP models. By removing them, we can clean and normalize the text, making it easier to understand its overall meaning.

For this, well make a regular expression that will help us to remove irrelevant characters.

# Import the regular expression
import re
# Apply the regex on each tweet using the lambda function
df['Cleaned Tweet'] = df['Tweet'].apply(lambda text: re.sub('[^a-zA-Z0-9\\s]', '', text))
# Print the cleaned data
df

Tokenization

This is often the most used step in any NLP task. Tokenization allows us to break text into tokens or smaller pieces of text that make sense. We can tokenize by sentences or words, but most people use word tokenization. It breaks the text into tokens wherever it finds a white space. For example, the sentence "Apify is your one-stop shop for web scraping" will look like this ['Apify', 'is', 'your', 'one-stop', 'shop', 'for', 'web', 'scraping'].

Tokenization example

Tokenization by word allows us to get the words that appear most often. For example, if were analyzing a group of ads or tweets related to jobs, we might find that "Python" has been used the most frequently. Let's look at an example.

import nltk
from nltk.probability import FreqDist

# Download the Punkt Tokenizer Models.
nltk.download('punkt')

# Apply the tokenization to every tweet
df['Tokenized Tweet'] = df['Cleaned Tweet'].apply(nltk.word_tokenize)
# Creating a single list containing all the words
all_words = [word for tokens in df['Tokenized Tweet'] for word in tokens]
# Calculate the frequency of each twwet
fdist = FreqDist(all_words)
# Get the most used word
most_common_word = fdist.max()
print('Most frequent word:', most_common_word)

The above code is first tokenizing each tweet into tokens and then creates a single list of all the tokens. In the end, the function FreqDist takes the list all_words and finds the most repeated token in the list.

An issue related to SSL may arise on macOS due to the system's security settings. Press the arrow button and use the code below to remove the issue.

Stemming and Lemmatization

Stemming and lemmatization are the most important and widely used normalization techniques during preprocessing. As both techniques reduce the words to the basic form. Let's see how:

Stemming : Stemming changes the word to its basic form simply by truncating the last or first few letters by considering a few prefixes and postfixes that can be found in a word. For example, it will change the words "Jumps", "Jumping", and "Jumped" to "Jump" which is a valid word. But most of the time, it fails to generate a valid word. For example, the stem of the words "Running", "Ran", and "Run" might be "Runn" which is not a valid word.
Lemmatization : Lemmatization is a proper normalization technique that reduces the word to its basic or root form, and unlike stemming, the result is always a valid word. For example, it will change the word "Running" to "Run" and "faster" to "fast".

So, a question arises here. If lemmatization performs well, why do we need stemming? Lemmatization is slower than stemming, so if speed is your goal rather than accuracy, stemming is an appropriate approach. However, if accuracy is crucial, use lemmatization.

Let's see coding examples for both stemming and lemmatization:

Stemming example

Well use the PorterStemmer class from the nltk.stem module. This is a stemming algorithm that is used to reduce words to their root form. This is often used in NLP tasks to normalize text data.

from nltk.stem import PorterStemmer
# Create an instance of PorterStemmer
stemmer = PorterStemmer()

# Apply the stemmer to each word in each tokenized tweet
df['Stemmed Tweet'] = df['Tokenized Tweet'].apply(lambda x: [stemmer.stem(i) for i in x])
# Print the Stemmed Tweets
df['Stemmed Tweet']

Well add a new column named Stemmed Tweet in the DataFrame with words in their root form.

Lemmatization example

Well use the WordNetLemmatizer class from the nltk.stem module. This is used to perform lemmatization using the WordNet lexical database of English words. As weve already discussed, lemmatization is a more effective method than stemming as it uses vocabulary analysis to reduce words to their root form.

from nltk.stem import WordNetLemmatizer
# Download the wordnet
nltk.download('wordnet')
# Create an instance of WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize each word in each tokenized tweet
df['Lemmatized Tweet'] = df['Tokenized Tweet'].apply(lambda x: [lemmatizer.lemmatize(i, pos='v') for i in x])

df['Lemmatized Tweet']

A new column named Lemmatized Tweet is added in the DataFrame with words in their root form.

📔 Note : We often use one technique at a time in any NLP task, not both of them.

Removing stop words

This technique is used to remove stop words like is , am , are , and they from the text, which are considered unnecessary in most NLP tasks. They do not have any meaning attached to them individually, as these are the words that are used to connect sentences or to show the relationship of a word with other words.

Let's remove these words and only go with the words or tokens with a meaning attached. We will use the stopwords corpus from the nltk.corpus module. It's a list of common words that are often considered noise in text data.

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Remove stop words from each tokenized tweet
df['Tweet Without Stopwords'] = df['Lemmatized Tweet'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

print(df['Tweet Without Stopwords'])

Well remove stopwords from the lemmatized column and add a new column named Tweet Without Stopwords.

Bag of Words (BoW)

The BoW model is used to convert text into fixed-length vectors. Later, we can use these vectors to feed our Machine Learning models. The BoW model doesn't care about the sequence of the words but only about the frequency and creates an n-dimensional vector against all the documents.

For example, the three sentences:

Sentence 1: "I love Apify"

Sentence 2: "I love to scrape data"

Sentence 3: "I love to code"

After removing the Stop words, the vector would look like this:

	love	Apify	scrape	data	code
Sentence 1	1	1	0	0	0
Sentence 2	1	0	1	1	0
Sentence 3	1	0	0	0	1

Let's see how it's done in Python. Well take the Cleaned Tweet column and convert it into a numerical format using CountVectorizer from the sklearn library. CountVectorizer turns the text into a matrix of token counts, representing words as numbers. After processing, each row in the resulting DataFrame corresponds to a tweet, and each column represents a unique word, with values indicating the word's occurrence count in each tweet.

import pandas as pd
# Import the countVectorizer
from sklearn.feature_extraction.text 
import CountVectorizer
# Create an instance of CountVectorizer with English stop words
Vectorizer = CountVectorizer(stop_words='english')
# Passing tweets to CountVectorizer
vectorizer_matrix = Vectorizer.fit_transform(df['Cleaned Tweet'])
# Create a matrix with words as columns and tweets as rows
final_df = pd.DataFrame(vectorizer_matrix.toarray(),
                      columns=Vectorizer.get_feature_names_out())
# Print the matrix
final_df

Named Entity Recognition (NER)

As the name suggests, NER is an information retrieval technique used to find and classify the named entities from text into categories like person, organization, place quantities, and so forth. The use cases of NER are text classification, customer support, recommendation systems, etc.

Sentence	Entity 1	Entity Type 1	Entity 2	Entity Type 2
Apify is your one-stop shop for web scraping, data extraction, and RPA located in Czechia.	Apify	ORGANIZATION	Czechia	LOCATION
Barack Obama was born in Hawaii.	Barack Obama	PERSON	Hawaii	LOCATION

Well use the spacy library to extract named entities from the Cleaned Tweet column and save them in a new column called entities .

We need to download a model that will perform name entity recognition for us. Run the following command in your notebook cell to download it.

!python3 -m spacy download en_core_web_sm

Once the downloading is completed, we can execute the code below.

import spacy
# Load Spacy model
model = spacy.load("en_core_web_sm")

# Function to get entities
def get_entities(text):
    doc = model(text)
    return [(X.text, X.label_) for X in doc.ents]
# Create a new column with entities
df['entities'] = df['Cleaned Tweet'].apply(get_entities)
# Print the DataFrame
df[['Cleaned Tweet', 'entities']]

In the code above, were using spaCys pre-trained model for the English language calleden_core_web_sm. Were using this model to find named entities from the Cleaned Tweet column. Weve defined a function to identify entities (like persons, organizations, locations, etc.) from each tweet. The final output is a DataFrame displaying the cleaned tweets with their named entities.

Sentiment Analysis

Sentiment analysis is one of the most used NLP techniques to extract emotions or sentiments expressed in a text. This text is mostly comments, reviews, or tweets. The labels are mostly positive, negative, and neutral.

Were performing sentiment analysis on the cleaned_data column using NLTK's Sentiment Intensity Analyzer and the VADER lexicon. It's specifically designed to perform sentiment analysis on any data.

from nltk.sentiment import SentimentIntensityAnalyzer
# Download the vader_lexicon
nltk.download('vader_lexicon')
# Create an instance of Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()

# Define a function to calculate the sentiment score
def get_sentiment_score(tweet):
    return sia.polarity_scores(tweet)['compound']

# Get the sentiment score of each tweet and saving in a new column
df['Sentiment Score'] = df['Cleaned Tweet'].apply(get_sentiment_score)

# This function assigna a label to each twwet depending on the score
def assign_label(score):
    if score > 0.05:
        return 'positive'
    elif score < -0.05:
        return 'negative'
    else:
        return 'neutral'

# Pass the score of each tweet to the function above
df['Sentiment'] = df['Sentiment Score'].apply(assign_label)

# Display the DataFrame
print(df[['Cleaned Tweet', 'Sentiment', 'Sentiment Score']])

First, we define a function called get_sentiment_score which calculates the sentiment score for each tweet based on the analyzer's polarity_scores function. We apply this function to the Cleaned Tweet column and store the sentiment scores in a new sentiment_score column.

Next, we define a function assign_label to assign a label depending on the numerical sentiment scores into categorical labels ('positive', 'negative', 'neutral').

Related What is Hugging Face and why use it for NLP and LLMs?

NLP vs. LLM: context, tone, syntax, and semantics

Problems with traditional NLP

NLP models have many use cases but they have limitations as well. Let's discuss them one by one.

Understanding context: when it comes to understanding the context, traditional approaches fail to get the complete context of the sentence. For example, the sentence "He is feeling blue today". Humans know that this sentence is related to the person's mood, but traditional NLP won't get this.
Sarcasm and humor: another big issue is detecting sarcasm and humor. We know that sarcasm can be an art form, but it's not detected by NLP systems. They can fail to identify sarcastic or humorous remarks. We humans use tone, context, and prior knowledge to assume these elements, and it's hard to program a system to do the same.
Syntax vs. semantics : NLP techniques are usually very good at understanding sentence structures, but they struggle with the meaning of sentences. For example, "The man bites the dog" and "The dog bites the man" have the same words and structure but entirely different meanings. Traditional NLP techniques dont always pick up on these differences.

Related What is Haystack? An introduction to the NLP framework

The rise of Large Language Models (LLMs)

For the field of Natural Language Processing (NLP), a remarkable shift has occurred with the emergence of Language Models (LMs). Traditional algorithms that once dominated the field are now being dominated by the power and capabilities of LMs.

Large Language Models (LLMs) like GPT-4 have drastically changed the field of Natural Language Processing. Theyre called Large Language Models because theyre trained on huge datasets with the ability to overcome all the limitations of traditional NLP.

Understanding context: LLMs are far better at understanding the context of a conversation or a sentence compared to traditional NLP systems. They are based on transformers that allow them to keep track of the entire sequence of a conversation, which helps to interpret ambiguous sentences more accurately.
Sarcasm and humor : Although sarcasm and humor detection is a tricky issue, LLMs are trained on a huge amount of data including a wide variety of linguistic styles, tones, and contexts. Moreover, the transformer-based architecture allows them to get a better grasp of sentences.
Syntax vs. semantics: LLMs are trained in such a way that they recognize patterns that allow them to understand both syntax and semantics. This enables them to differentiate between phrases with the same structure but different meanings.

Apify and LLMs

Apify has responded to the rise of LLMs by providing support for these models with their existing Actors (serverless cloud programs). This allows businesses to easily extract data from the web and train their own LLMs.

Apify also provides support for LangChain, which is a platform that allows businesses to train their own LLMs on their own data. This is important because ChatGPT, which is one of the most popular LLMs, is trained on data that was collected before 2021. This means that ChatGPT may not be as accurate or up-to-date as LLMs that are trained on more recent data.

By providing support for LLMs, Apify is helping businesses take advantage of the latest NLP technologies. This will allow companies and organizations to improve their products and services,and to better understand their customers.

If you're interested LangChain and want to train an LLM with your own data, here are a few helpful guides for you:

Frequently asked questions

What is Natural Language Processing (NLP)?

NLP is a field of Artificial Intelligence (AI) that aims to make computers understand, interpret, and generate human language.

What are Large Language Models (LLMs)?

LLMs are machine learning models trained on a large corpus of text. They can understand, interpret, and generate human language.

How do LLMs work?

LLMs are pre-trained on a large corpus of text data and then fine-tuned on a specific task. This allows them to leverage the knowledge learned during pre-training to perform well on a range of tasks.

What are some applications of LLMs?

LLMs have a wide range of applications, including sentiment analysis, text generation, question answering, text summarization, and machine translation.

What are the challenges of using LLMs?

The challenges with LLMs include the high cost of training, ethical concerns related to bias and misinformation, and issues with explainability and interpretability.

Top comments (1)

artydev • Oct 25 '24

Thank you

Forem