You need data from Amazon. You write a simple requests script. It works. Then... 403 Forbidden. CAPTCHA. IP Ban. You add proxies, User-Agent rotation. Next week, Amazon changes a CSS class, and your script breaks again.
The truth is: maintaining scrapers is worse than writing them.
In this tutorial, we're skipping all that pain. We're going to use an API-first approach. We're going to let a specialized API (a pre-built Apify Actor) handle the scraping hell, while we focus on the fun part: analyzing the data with Python and Pandas.
Step 1: Set Up Your Environment
First, let's install the libraries. We'll use apify-client to call the API and pandas for data handling.
pip install apify-client pandas
You'll also need an Apify account (the free tier is fine) to get your API token. You can find it in your account settings under "Integrations."
Step 2: Define Your "Advanced Filter" Task
We don't want all reviews. That's noise. We want the "verified purchase" 1- and 2-star reviews to find a competitor's fatal flaw.
To run this specific scrape, we'll call the Amazon Reviews Scraper with Advanced Filters. Its value is that it accepts a detailed JSON input to handle this filtering on the server side.
Here's the JSON "payload" we'll send:
{
"productAsins": ["B09JVCL7JR"],
"filterByStarRating": [1, 2],
"filterByVerifiedPurchase": true,
"minReviewLength": 50,
"maxReviews": 100
}
(I'm using a popular earbud ASIN as an example. maxReviews: 100 is good practice to keep our test run fast.)
Step 3: Run the Actor & Fetch Data (The Python Script)
Now for the core code. We'll initialize the client, call the Actor, wait for it to finish, and pull the results into a list.
This code does all the heavy lifting:
import os
import pandas as pd
from apify_client import ApifyClient
# Get your token from an environment variable (recommended)
# Or just paste it: apify_client = ApifyClient("YOUR_TOKEN")
APIFY_TOKEN = os.environ.get("APIFY_TOKEN")
if not APIFY_TOKEN:
raise Exception("Please set the 'APIFY_TOKEN' environment variable")
# 1. Initialize the client
apify_client = ApifyClient(APIFY_TOKEN)
print("Running the Actor...")
# 2. Define our input payload
actor_input = {
"productAsins": ["B08N5HRT9B"], # Example ASIN
"filterByStarRating": [1, 2],
"filterByVerifiedPurchase": true,
"minReviewLength": 50,
"maxReviews": 100
}
# 3. Asynchronously call the Actor and wait for it to finish
run = apify_client.actor("delicious_zebu/amazon-reviews-scraper-with-advanced-filters").call(
run_input=actor_input,
wait_secs=120 # Wait a max of 2 minutes
)
print("Run finished. Fetching results...")
# 4. Get the results from the Actor's dataset
items = []
for item in apify_client.dataset(run["defaultDatasetId"]).iterate_items():
items.append(item)
# 5. Load into a Pandas DataFrame
df = pd.DataFrame(items)
print(f"Successfully fetched {len(df)} reviews.")
print(df.head())
Step 4: [The Payoff] Local Analysis with Pandas
Just like that. No messing with headless browsers, no proxies, no parsing HTML. We now have a clean Pandas DataFrame.
Now for the fun part. Let's analyze it instantly.
Let's find out how many of these negative reviews mention "battery" or "connection" issues:
if 'reviewText' in df.columns and not df.empty:
# Find reviews mentioning 'battery' or 'connection' issues
keywords = ['battery', 'connection', 'disconnect', 'charge']
# Build a regex pattern
pattern = '|'.join(keywords)
# Filter the DataFrame
complaints_df = df[df['reviewText'].str.contains(pattern, case=False, na=False)]
print(f"\nFound {len(complaints_df)} complaints out of {len(df)} total reviews mentioning: {keywords}")
# Print a few examples
for text in complaints_df['reviewText'].head(5):
print(f"- {text[:150]}...") # Print a snippet
else:
print("\n'reviewText' column not found or DataFrame is empty.")
Example Output:
Found 42 complaints out of 100 total reviews mentioning: ['battery', 'connection', 'disconnect', 'charge']
- The right earbud disconnects constantly. I've tried everything...
- Battery life is a joke, lasts maybe 2 hours instead of the 8 advertised...
- Love the sound, but the connection drops every 5 minutes...
- Won't hold a charge after 3 weeks...
- ...
In a few lines of Pandas, we've zeroed in on this competitor's potential fatal flaw: connection and battery issues.
Conclusion
What did we just do?
We built a reproducible, reliable data pipeline in under 50 lines of Python. We completely skipped the fragile "scraper dev & maintenance" cycle.
By offloading the scraping task to a specialized API endpoint (the Actor we used today), we saved ourselves weeks of dev and maintenance time, allowing us to focus on what actually matters: analyzing the data.
You can find this Amazon Reviews Scraper with Advanced Filters on the Apify Store. Happy coding!

Top comments (0)