How to Get Filtered Amazon Reviews into a Pandas DataFrame in Under 50 Lines of Python

#python #scraper #dataengineering

We've all been there.

You need data from Amazon. You write a simple requests script. It works. Then... 403 Forbidden. CAPTCHA. IP Ban. You add proxies, User-Agent rotation. Next week, Amazon changes a CSS class, and your script breaks again.

The truth is: maintaining scrapers is worse than writing them.

In this tutorial, we're skipping all that pain. We're going to use an API-first approach. We're going to let a specialized API (a pre-built Apify Actor) handle the scraping hell, while we focus on the fun part: analyzing the data with Python and Pandas.

Step 1: Set Up Your Environment
First, let's install the libraries. We'll use apify-client to call the API and pandas for data handling.

pip install apify-client pandas

You'll also need an Apify account (the free tier is fine) to get your API token. You can find it in your account settings under "Integrations."

Step 2: Define Your "Advanced Filter" Task
We don't want all reviews. That's noise. We want the "verified purchase" 1- and 2-star reviews to find a competitor's fatal flaw.

To run this specific scrape, we'll call the Amazon Reviews Scraper with Advanced Filters. Its value is that it accepts a detailed JSON input to handle this filtering on the server side.

Here's the JSON "payload" we'll send:

{
  "productAsins": ["B09JVCL7JR"], 
  "filterByStarRating": [1, 2],
  "filterByVerifiedPurchase": true,
  "minReviewLength": 50,
  "maxReviews": 100 
}

(I'm using a popular earbud ASIN as an example. maxReviews: 100 is good practice to keep our test run fast.)

Step 3: Run the Actor & Fetch Data (The Python Script)
Now for the core code. We'll initialize the client, call the Actor, wait for it to finish, and pull the results into a list.

This code does all the heavy lifting:

import os
import pandas as pd
from apify_client import ApifyClient

# Get your token from an environment variable (recommended)
# Or just paste it: apify_client = ApifyClient("YOUR_TOKEN")
APIFY_TOKEN = os.environ.get("APIFY_TOKEN")

if not APIFY_TOKEN:
    raise Exception("Please set the 'APIFY_TOKEN' environment variable")

# 1. Initialize the client
apify_client = ApifyClient(APIFY_TOKEN)

print("Running the Actor...")

# 2. Define our input payload
actor_input = {
  "productAsins": ["B08N5HRT9B"], # Example ASIN
  "filterByStarRating": [1, 2],
  "filterByVerifiedPurchase": true,
  "minReviewLength": 50,
  "maxReviews": 100 
}

# 3. Asynchronously call the Actor and wait for it to finish
run = apify_client.actor("delicious_zebu/amazon-reviews-scraper-with-advanced-filters").call(
    run_input=actor_input,
    wait_secs=120 # Wait a max of 2 minutes
)

print("Run finished. Fetching results...")

# 4. Get the results from the Actor's dataset
items = []
for item in apify_client.dataset(run["defaultDatasetId"]).iterate_items():
    items.append(item)

# 5. Load into a Pandas DataFrame
df = pd.DataFrame(items)

print(f"Successfully fetched {len(df)} reviews.")
print(df.head())

Step 4: [The Payoff] Local Analysis with Pandas
Just like that. No messing with headless browsers, no proxies, no parsing HTML. We now have a clean Pandas DataFrame.

Now for the fun part. Let's analyze it instantly.

Let's find out how many of these negative reviews mention "battery" or "connection" issues:

if 'reviewText' in df.columns and not df.empty:
    # Find reviews mentioning 'battery' or 'connection' issues
    keywords = ['battery', 'connection', 'disconnect', 'charge']

    # Build a regex pattern
    pattern = '|'.join(keywords)

    # Filter the DataFrame
    complaints_df = df[df['reviewText'].str.contains(pattern, case=False, na=False)]

    print(f"\nFound {len(complaints_df)} complaints out of {len(df)} total reviews mentioning: {keywords}")

    # Print a few examples
    for text in complaints_df['reviewText'].head(5):
        print(f"- {text[:150]}...") # Print a snippet
else:
    print("\n'reviewText' column not found or DataFrame is empty.")

Example Output:

Found 42 complaints out of 100 total reviews mentioning: ['battery', 'connection', 'disconnect', 'charge']

- The right earbud disconnects constantly. I've tried everything...
- Battery life is a joke, lasts maybe 2 hours instead of the 8 advertised...
- Love the sound, but the connection drops every 5 minutes...
- Won't hold a charge after 3 weeks...
- ...

In a few lines of Pandas, we've zeroed in on this competitor's potential fatal flaw: connection and battery issues.

Conclusion
What did we just do?

We built a reproducible, reliable data pipeline in under 50 lines of Python. We completely skipped the fragile "scraper dev & maintenance" cycle.

By offloading the scraping task to a specialized API endpoint (the Actor we used today), we saved ourselves weeks of dev and maintenance time, allowing us to focus on what actually matters: analyzing the data.

You can find this Amazon Reviews Scraper with Advanced Filters on the Apify Store. Happy coding!

DEV Community

How to Get Filtered Amazon Reviews into a Pandas DataFrame in Under 50 Lines of Python

Top comments (0)