DEV Community: Giuseppe Schillaci

Building a Review Scraper with Python using BeautifulSoup and Sentiment Analysis with NLTK

Giuseppe Schillaci — Sat, 10 Feb 2024 22:13:23 +0000

Introduction

In recent years, analyzing online reviews has become a crucial aspect for many businesses. Understanding customer sentiment can help identify areas for improvement and evaluate overall customer satisfaction. In this article, we'll explore how to use Python to create a review scraper and analyze sentiment using the BeautifulSoup and NLTK libraries.

Creating the Review Scraper with BeautifulSoup

To begin, we utilized Python along with the BeautifulSoup library to extract reviews from a leading Italian company's online review site. BeautifulSoup allows us to parse the HTML markup of a web page and efficiently extract the data of interest. Using BeautifulSoup's features, we extracted the reviews and saved them for further analysis.

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Number of pages to scrape
page_start = 1
page_end = 49

# DataFrame to store the data
df = pd.DataFrame(columns=["title", "text"])

# Loop through the pages
for page_num in range(page_start, page_end + 1):
    # Construct the URL for the current page
    url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}'

    # Make an HTTP request to fetch the page content
    response = requests.get(url)
    if response.status_code == 200:
        # Use BeautifulSoup to parse the HTML of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find all review elements
        reviews = soup.find_all(attrs={"data-review-content": True})

        # Extract title and text of each review and add them to the DataFrame
        for review in reviews:
            title_element = review.find(attrs={"data-service-review-title-typography": True})
            content_element = review.find(attrs={"data-service-review-text-typography": True})

            if title_element and content_element:
                title = title_element.text
                content = content_element.text
                # Add data to the DataFrame
                df = df.append({"title": title, "text": content}, ignore_index=True)
            else:
                print("Title or text element not found.")

# Print the DataFrame with all review data
df

Review Analysis with NLTK

Once the reviews were extracted, we employed the Natural Language Toolkit (NLTK), a widely-used Python library for Natural Language Processing (NLP). NLTK provides a range of tools for text analysis, including sentiment analysis.

We used NLTK's SentimentIntensityAnalyzer to assess the sentiment of the reviews. This analyzer assigns a numerical score to each review, indicating whether the sentiment is positive, negative, or neutral. This analysis provided us with a clear insight into customer sentiment towards the company.


import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')

# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()

# Define a function to get the sentiment of a text
def get_sentiment(text):
    # Calculate the sentiment score of the text
    scores = sid.polarity_scores(text)
    # Determine the sentiment based on the compound score
    if scores['compound'] >= 0.05:
        return 'positive'
    elif scores['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'

Visualizing the Results

Finally, we used the analyzed data to create bar and pie charts displaying the percentages of negative, positive, and neutral reviews. These charts offer a visual representation of the overall sentiment of the reviews and allow for easy identification of trends.

import matplotlib.pyplot as plt

# Count unique values in the 'sentiment' column
value_counts = df['sentiment'].value_counts()

# Define colors for each category
colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'}

# Create a pie chart using the defined colors
plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%')

# Add title
plt.title('Sentiment Analysis of Reviews for Company XYZ')

# Show the chart
plt.show()

Conclusion

In this article, we've seen how to use Python along with the BeautifulSoup and NLTK libraries to create a review scraper and analyze online sentiment. The combination of these powerful libraries allowed us to gain valuable insights into customer sentiment and visualize the results clearly and comprehensively.

By employing similar techniques, businesses can actively monitor customer feedback and make informed decisions to enhance overall customer experience. The combination of web scraping and sentiment analysis is a powerful tool for online reputation monitoring and customer relationship management.

Building a Review Scraper with Python using BeautifulSoup and Sentiment Analysis with NLTK

Giuseppe Schillaci — Sat, 10 Feb 2024 21:27:15 +0000

Introduction

Creating the Review Scraper with BeautifulSoup

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Number of pages to scrape
page_start = 1
page_end = 49

# DataFrame to store the data
df = pd.DataFrame(columns=["title", "text"])

# Loop through the pages
for page_num in range(page_start, page_end + 1):
    # Construct the URL for the current page
    url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}'

    # Make an HTTP request to fetch the page content
    response = requests.get(url)
    if response.status_code == 200:
        # Use BeautifulSoup to parse the HTML of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find all review elements
        reviews = soup.find_all(attrs={"data-review-content": True})

        # Extract title and text of each review and add them to the DataFrame
        for review in reviews:
            title_element = review.find(attrs={"data-service-review-title-typography": True})
            content_element = review.find(attrs={"data-service-review-text-typography": True})

            if title_element and content_element:
                title = title_element.text
                content = content_element.text
                # Add data to the DataFrame
                df = df.append({"title": title, "text": content}, ignore_index=True)
            else:
                print("Title or text element not found.")

# Print the DataFrame with all review data
df

Review Analysis with NLTK


import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')

# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()

# Define a function to get the sentiment of a text
def get_sentiment(text):
    # Calculate the sentiment score of the text
    scores = sid.polarity_scores(text)
    # Determine the sentiment based on the compound score
    if scores['compound'] >= 0.05:
        return 'positive'
    elif scores['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'

Visualizing the Results

import matplotlib.pyplot as plt

# Count unique values in the 'sentiment' column
value_counts = df['sentiment'].value_counts()

# Define colors for each category
colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'}

# Create a pie chart using the defined colors
plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%')

# Add title
plt.title('Sentiment Analysis of Reviews for Company XYZ')

# Show the chart
plt.show()