I recently started working on my first NLP and Sentiment Analysis project using the Amazon Reviews dataset on Kaggle.
I expected to find a normal CSV file that I could load using pandas.read_csv().
Instead, I found files like this:
train.ft.txt.bz2
test.ft.txt.bz2
At first, I was confused.
- What is a
.bz2file? - How do I read it?
- Why isn't there a CSV file?
After some searching and experimentation, I discovered that the dataset is stored in a compressed BZip2 format. Fortunately, Python provides a built-in bz2 library that can read these files directly.
Step 1: Reading the compressed file
The first step was simply reading the compressed file.
import bz2
import pandas as pd
file_path = "/kaggle/input/datasets/bittlingmayer/amazonreviews/train.ft.txt.bz2"
with bz2.open(file_path, "rt", encoding="utf-8") as f:
lines = f.readlines()
print(lines[0])
The output looked something like this:
__label__2 Stunning even for the non-gamer: This soundtrack was beautiful...
At this point, I noticed that every line starts with a label followed by the review text.
Step 2: Separating Labels and Reviews
Next, I separated the sentiment label from the review text.
labels = []
sentences = []
for line in lines:
label = 1 if line.split(" ")[0] == "__label__2" else 0
sentence = line.split(" ", 1)[1]
labels.append(label)
sentences.append(sentence)
The dataset uses:
-
__label__1→ Negative Review -
__label__2→ Positive Review
Since machine learning models work better with numerical values, I converted them into:
-
0→ Negative -
1→ Positive
Step 3: Creating a DataFrame
Finally, I converted everything into a Pandas DataFrame.
df = pd.DataFrame({
"review": sentences,
"sentiment": labels})
print(df.head())
Example output:
review sentiment
0 Stunning even for the non-gamer: This soundtra... 1
1 The best soundtrack ever to anything... 1
2 Amazing! This soundtrack is my favorite music... 1
Now the dataset is finally in a format that can be used for:
- Text preprocessing
- Feature extraction
- Model training
- Evaluation
What I Learned:
- Not every Kaggle dataset comes as a CSV file.
-
.bz2is simply a compressed file format. - Python's built-in bz2 library can read these files directly.
- Amazon Review datasets use text labels instead of numerical labels.
- Converting the data into a DataFrame makes the next NLP steps much easier.
Final Thoughts
This was a small issue, but it completely blocked my progress for a while.
As a beginner in NLP, I am discovering that many challenges are not about machine learning algorithms but about understanding datasets and data formats.
Hopefully, this saves someone else a few hours of confusion.
Top comments (0)