How I Extracted Amazon Review Text Data from Kaggle's .bz2 Dataset for Sentiment Analysis

#news #kaggle #rnn #programming

I recently started working on my first NLP and Sentiment Analysis project using the Amazon Reviews dataset on Kaggle.

I expected to find a normal CSV file that I could load using pandas.read_csv().

Instead, I found files like this:

train.ft.txt.bz2
test.ft.txt.bz2

At first, I was confused.

What is a .bz2 file?
How do I read it?
Why isn't there a CSV file?

After some searching and experimentation, I discovered that the dataset is stored in a compressed BZip2 format. Fortunately, Python provides a built-in bz2 library that can read these files directly.

Step 1: Reading the compressed file

The first step was simply reading the compressed file.

import bz2 
import pandas as pd 
file_path = "/kaggle/input/datasets/bittlingmayer/amazonreviews/train.ft.txt.bz2"
with bz2.open(file_path, "rt", encoding="utf-8") as f:
 lines = f.readlines() 
print(lines[0])

The output looked something like this:
__label__2 Stunning even for the non-gamer: This soundtrack was beautiful...

At this point, I noticed that every line starts with a label followed by the review text.

Step 2: Separating Labels and Reviews

Next, I separated the sentiment label from the review text.

labels = []
sentences = []

for line in lines:
    label = 1 if line.split(" ")[0] == "__label__2" else 0
    sentence = line.split(" ", 1)[1]

    labels.append(label)
    sentences.append(sentence)

The dataset uses:

__label__1 → Negative Review
__label__2 → Positive Review

Since machine learning models work better with numerical values, I converted them into:

0 → Negative
1 → Positive

Step 3: Creating a DataFrame

Finally, I converted everything into a Pandas DataFrame.

df = pd.DataFrame({
    "review": sentences,
    "sentiment": labels})

print(df.head())

Example output:

                                              review  sentiment
0  Stunning even for the non-gamer: This soundtra...          1
1  The best soundtrack ever to anything...                    1
2  Amazing! This soundtrack is my favorite music...           1

Now the dataset is finally in a format that can be used for:

Text preprocessing
Feature extraction
Model training
Evaluation

What I Learned:

Not every Kaggle dataset comes as a CSV file.
.bz2 is simply a compressed file format.
Python's built-in bz2 library can read these files directly.
Amazon Review datasets use text labels instead of numerical labels.
Converting the data into a DataFrame makes the next NLP steps much easier.

Final Thoughts

This was a small issue, but it completely blocked my progress for a while.
As a beginner in NLP, I am discovering that many challenges are not about machine learning algorithms but about understanding datasets and data formats.

Hopefully, this saves someone else a few hours of confusion.