DEV Community

SUBHRANIL DAS
SUBHRANIL DAS

Posted on

How I Extracted Amazon Review Text Data from Kaggle's .bz2 Dataset for Sentiment Analysis

I recently started working on my first NLP and Sentiment Analysis project using the Amazon Reviews dataset on Kaggle.

I expected to find a normal CSV file that I could load using pandas.read_csv().

Instead, I found files like this:

train.ft.txt.bz2
test.ft.txt.bz2

At first, I was confused.

  • What is a .bz2 file?
  • How do I read it?
  • Why isn't there a CSV file?

After some searching and experimentation, I discovered that the dataset is stored in a compressed BZip2 format. Fortunately, Python provides a built-in bz2 library that can read these files directly.

Step 1: Reading the compressed file

The first step was simply reading the compressed file.

import bz2 
import pandas as pd 
file_path = "/kaggle/input/datasets/bittlingmayer/amazonreviews/train.ft.txt.bz2"
with bz2.open(file_path, "rt", encoding="utf-8") as f:
 lines = f.readlines() 
print(lines[0])
Enter fullscreen mode Exit fullscreen mode

The output looked something like this:
__label__2 Stunning even for the non-gamer: This soundtrack was beautiful...

At this point, I noticed that every line starts with a label followed by the review text.

Step 2: Separating Labels and Reviews

Next, I separated the sentiment label from the review text.

labels = []
sentences = []

for line in lines:
    label = 1 if line.split(" ")[0] == "__label__2" else 0
    sentence = line.split(" ", 1)[1]

    labels.append(label)
    sentences.append(sentence)
Enter fullscreen mode Exit fullscreen mode

The dataset uses:

  • __label__1 → Negative Review
  • __label__2 → Positive Review

Since machine learning models work better with numerical values, I converted them into:

  • 0 → Negative
  • 1 → Positive

Step 3: Creating a DataFrame

Finally, I converted everything into a Pandas DataFrame.

df = pd.DataFrame({
    "review": sentences,
    "sentiment": labels})

print(df.head())
Enter fullscreen mode Exit fullscreen mode

Example output:

                                              review  sentiment
0  Stunning even for the non-gamer: This soundtra...          1
1  The best soundtrack ever to anything...                    1
2  Amazing! This soundtrack is my favorite music...           1
Enter fullscreen mode Exit fullscreen mode

Now the dataset is finally in a format that can be used for:

  • Text preprocessing
  • Feature extraction
  • Model training
  • Evaluation

What I Learned:

  • Not every Kaggle dataset comes as a CSV file.
  • .bz2 is simply a compressed file format.
  • Python's built-in bz2 library can read these files directly.
  • Amazon Review datasets use text labels instead of numerical labels.
  • Converting the data into a DataFrame makes the next NLP steps much easier.

Final Thoughts

This was a small issue, but it completely blocked my progress for a while.
As a beginner in NLP, I am discovering that many challenges are not about machine learning algorithms but about understanding datasets and data formats.

Hopefully, this saves someone else a few hours of confusion.

Top comments (0)