DEV Community

BAH123
BAH123

Posted on

How to Get Filtered Amazon Reviews into a Pandas DataFrame in Under 50 Lines of Python


We've all been there.

You need data from Amazon. You write a simple requests script. It works. Then... 403 Forbidden. CAPTCHA. IP Ban. You add proxies, User-Agent rotation. Next week, Amazon changes a CSS class, and your script breaks again.

The truth is: maintaining scrapers is worse than writing them.

In this tutorial, we're skipping all that pain. We're going to use an API-first approach. We're going to let a specialized API (a pre-built Apify Actor) handle the scraping hell, while we focus on the fun part: analyzing the data with Python and Pandas.

Step 1: Set Up Your Environment
First, let's install the libraries. We'll use apify-client to call the API and pandas for data handling.

pip install apify-client pandas
Enter fullscreen mode Exit fullscreen mode

You'll also need an Apify account (the free tier is fine) to get your API token. You can find it in your account settings under "Integrations."

Step 2: Define Your "Advanced Filter" Task
We don't want all reviews. That's noise. We want the "verified purchase" 1- and 2-star reviews to find a competitor's fatal flaw.

To run this specific scrape, we'll call the Amazon Reviews Scraper with Advanced Filters. Its value is that it accepts a detailed JSON input to handle this filtering on the server side.

Here's the JSON "payload" we'll send:

{
  "productAsins": ["B09JVCL7JR"], 
  "filterByStarRating": [1, 2],
  "filterByVerifiedPurchase": true,
  "minReviewLength": 50,
  "maxReviews": 100 
}
Enter fullscreen mode Exit fullscreen mode

(I'm using a popular earbud ASIN as an example. maxReviews: 100 is good practice to keep our test run fast.)

Step 3: Run the Actor & Fetch Data (The Python Script)
Now for the core code. We'll initialize the client, call the Actor, wait for it to finish, and pull the results into a list.

This code does all the heavy lifting:

import os
import pandas as pd
from apify_client import ApifyClient

# Get your token from an environment variable (recommended)
# Or just paste it: apify_client = ApifyClient("YOUR_TOKEN")
APIFY_TOKEN = os.environ.get("APIFY_TOKEN")

if not APIFY_TOKEN:
    raise Exception("Please set the 'APIFY_TOKEN' environment variable")

# 1. Initialize the client
apify_client = ApifyClient(APIFY_TOKEN)

print("Running the Actor...")

# 2. Define our input payload
actor_input = {
  "productAsins": ["B08N5HRT9B"], # Example ASIN
  "filterByStarRating": [1, 2],
  "filterByVerifiedPurchase": true,
  "minReviewLength": 50,
  "maxReviews": 100 
}

# 3. Asynchronously call the Actor and wait for it to finish
run = apify_client.actor("delicious_zebu/amazon-reviews-scraper-with-advanced-filters").call(
    run_input=actor_input,
    wait_secs=120 # Wait a max of 2 minutes
)

print("Run finished. Fetching results...")

# 4. Get the results from the Actor's dataset
items = []
for item in apify_client.dataset(run["defaultDatasetId"]).iterate_items():
    items.append(item)

# 5. Load into a Pandas DataFrame
df = pd.DataFrame(items)

print(f"Successfully fetched {len(df)} reviews.")
print(df.head())
Enter fullscreen mode Exit fullscreen mode

Step 4: [The Payoff] Local Analysis with Pandas
Just like that. No messing with headless browsers, no proxies, no parsing HTML. We now have a clean Pandas DataFrame.

Now for the fun part. Let's analyze it instantly.

Let's find out how many of these negative reviews mention "battery" or "connection" issues:

if 'reviewText' in df.columns and not df.empty:
    # Find reviews mentioning 'battery' or 'connection' issues
    keywords = ['battery', 'connection', 'disconnect', 'charge']

    # Build a regex pattern
    pattern = '|'.join(keywords)

    # Filter the DataFrame
    complaints_df = df[df['reviewText'].str.contains(pattern, case=False, na=False)]

    print(f"\nFound {len(complaints_df)} complaints out of {len(df)} total reviews mentioning: {keywords}")

    # Print a few examples
    for text in complaints_df['reviewText'].head(5):
        print(f"- {text[:150]}...") # Print a snippet
else:
    print("\n'reviewText' column not found or DataFrame is empty.")
Enter fullscreen mode Exit fullscreen mode

Example Output:

Found 42 complaints out of 100 total reviews mentioning: ['battery', 'connection', 'disconnect', 'charge']

- The right earbud disconnects constantly. I've tried everything...
- Battery life is a joke, lasts maybe 2 hours instead of the 8 advertised...
- Love the sound, but the connection drops every 5 minutes...
- Won't hold a charge after 3 weeks...
- ...
Enter fullscreen mode Exit fullscreen mode

In a few lines of Pandas, we've zeroed in on this competitor's potential fatal flaw: connection and battery issues.

Conclusion
What did we just do?

We built a reproducible, reliable data pipeline in under 50 lines of Python. We completely skipped the fragile "scraper dev & maintenance" cycle.

By offloading the scraping task to a specialized API endpoint (the Actor we used today), we saved ourselves weeks of dev and maintenance time, allowing us to focus on what actually matters: analyzing the data.

You can find this Amazon Reviews Scraper with Advanced Filters on the Apify Store. Happy coding!

Top comments (0)