Precious Lyna Anusiem

Posted on Apr 21

I Spent a Weekend Building Python Scrapers for 10 Websites. Here's Everything I Learned.

#automation #productivity #python #webscraping

There's a shortcut at the end — but the lessons in between are genuinely worth reading.

Last spring, I needed eBay sold price data for about 200 items I was researching before an estate sale. Nothing complicated — just what did this thing actually sell for, recently, in used condition.

I figured it would take an hour. Maybe two.

Three hours later I had data on maybe 40 items, a stiff neck, and the particular kind of frustration that only comes from doing the same repetitive action several hundred times. Open tab. Search. Filter to sold listings. Scan prices. Write down number. Close tab. Repeat.

There had to be a better way.

So I did what any mildly annoyed person with a Python tutorial under their belt does: I automated it.

That one scraper turned into a weekend project. That weekend project turned into ten scrapers — covering Amazon, eBay, Indeed, LinkedIn, Zillow, Twitter, Reddit, Yelp, Google News, and a price tracker that emails me alerts when something drops below my target price.

Here's everything I learned along the way.

The Stack I Used (Deliberately Boring)

requests — HTTP calls
BeautifulSoup — HTML parsing
fake-useragent — rotating user-agent strings
tweepy and praw — for the two sites with decent free APIs (Twitter and Reddit)
csv — built-in, no install needed

No Selenium. No Playwright. No headless Chrome running in the background eating RAM. I made a deliberate choice to keep everything as simple as possible, and I'll explain why that was the right call.

The Most Useful Thing I Learned: Look for JSON Before You Scrape HTML

There's a pattern that almost no scraping tutorial covers, and it saved me hours.

Modern websites built on React or Next.js often embed the initial page data as a JSON blob inside a <script> tag. This is the data the frontend uses to render what you see. And it's sitting right there in the page source — structured, predictable, and usually more complete than what ends up in the rendered HTML.

for script in soup.find_all("script", {"type": "application/ld+json"}):
    try:
        data = json.loads(script.string or "")
        # Your data is in here more often than you'd expect
    except json.JSONDecodeError:
        continue

Zillow uses this pattern heavily. So does Yelp. Parsing JSON is infinitely more reliable than chasing CSS class names that change whenever someone on the frontend team refactors a component. I build JSON-first parsing with an HTML fallback in every scraper where this is available.

The Delay Is Not Optional (And It Has to Be Random)

Every scraper I built has a randomized delay between requests. Not fixed — random. This distinction matters more than most people realize.

A fixed time.sleep(2) is a fingerprint. Every request spaced exactly two seconds apart is a signal. time.sleep(random.uniform(2, 5)) looks like a human who gets distracted between clicks.

I also found that slightly longer delays on the first request — simulating someone waiting for a page to load and actually reading it — helped on sites with more aggressive rate limiting.

Is any of this foolproof? No. But it's the difference between scraping 500 results cleanly and getting blocked on page three.

Two Sites Have Good Free APIs — Use Them Instead of Scraping

Reddit and Twitter both offer free developer API access that most people don't bother with because they assume it's complicated or expensive. It's neither.

Reddit's API (via the praw library) is excellent. Upvote ratios, comment trees, post metadata — all clean, structured, and reliable. Setup takes about five minutes at reddit.com/prefs/apps.

Twitter's free tier is more restricted than it used to be, but for pulling public profile data and recent tweets it's still perfectly usable. The alternative — scraping Twitter's frontend — is a much harder problem. The API route is just better.

import praw

reddit = praw.Reddit(
    client_id="YOUR_ID",
    client_secret="YOUR_SECRET", 
    user_agent="MyBot/1.0"
)

for post in reddit.subreddit("entrepreneur").hot(limit=50):
    print(post.title, post.score, post.upvote_ratio)

Thirty lines of code. Real data. No parsing, no fragile selectors.

The eBay Sold Listings Scraper Is the One People Don't Know They Need

This was the scraper that started everything, and it remains the one I use most regularly.

The key insight: eBay separates completed listings (things that actually sold) from active listings. Most people browse active listings and assume those prices reflect reality. They don't. The completed listings show you what real buyers actually paid — which is often meaningfully different.

The URL parameter that unlocks this: LH_Sold=1&LH_Complete=1. Once you have that, you're pulling real transaction data.

I also added automatic price statistics at the end of each run:

📈 PRICE SUMMARY (147 sold listings):
   Minimum:  $23.00
   Maximum:  $312.00
   Average:  $87.43
   Median:   $74.00

The median versus average comparison is more useful than it might seem. A few high-priced outlier sales can pull the average up significantly. Median tells you what you'll realistically see.

The Price Tracker Changed My Daily Behavior

Of the ten scripts, the price tracker is the one that earns its keep most consistently.

You give it a list of product URLs and target prices. It checks prices on a schedule, logs everything to CSV, and sends an email when something drops below your threshold. I run it every morning via a cron job:

0 9 * * * python /path/to/05_price_tracker.py

What surprised me was how quickly I stopped thinking about it. It just runs. It shows up in your inbox when something matters. That's the ideal version of automation — invisible until it's useful.

I've used it to catch drops on headphones, a camera lens that almost never goes on sale, and a mechanical keyboard I'd been watching for months. None of those would have happened if I'd had to remember to check manually.

Yelp Is an Underrated Lead Generation Tool

I built the Yelp scraper mostly as an afterthought. It turned out to be one of the most immediately practical ones.

Pull every business in a category and city — with name, rating, review count, phone number, address, and website. Ten minutes of runtime gets you a spreadsheet that a data broker would charge hundreds of dollars for.

The angle I find most useful: filter for businesses with no website listed. Those are your warm prospects — established enough to have a Yelp presence, but clearly not prioritizing their online presence. If you sell any kind of digital service, that's your list.

What I Packaged Up

After cleaning everything up, commenting every non-obvious line, and adding a CONFIGURE HERE block at the top of each script so non-developers can use them without reading the whole thing, I put the pack together.

Ten scripts total:

Amazon product search results
Indeed job listings
Zillow for-sale and rental listings
Twitter/X public profiles and recent tweets
Cross-site price tracker with email alerts
LinkedIn public job board (no login needed)
eBay sold listings with automatic price statistics
Google News for any topic
Yelp business listings
Reddit posts and comments

Every script outputs a clean CSV. Every one has random delays and rotating user-agents built in. Every one is commented well enough to modify without deep Python knowledge. There's a full README with setup instructions.

The full pack is $19 → anusiempreciouso.gumroad.com/l/Python_WebScraping_Templates_Pack

If any script doesn't work as described, I'll fix it or refund you — no back-and-forth.

One Honest Note on Legality

Web scraping public data sits in a legal grey area, and I'd rather address that directly than bury a disclaimer.

US courts have generally upheld the legality of scraping publicly visible data — the hiQ v. LinkedIn ruling being the most significant example. But "generally legal" is not the same as universally fine, and each site has its own Terms of Service.

My personal guidelines: don't hammer servers (the built-in delays handle this), don't scrape behind authentication, don't republish scraped data as your own product. The scripts are built with these principles as defaults.

Going Further

A few things worth knowing if you want to extend any of these:

Proxies — For high-volume scraping, rotating residential proxies reduce blocking significantly. For the casual research use cases these scripts are built for, you probably don't need them.

Playwright/Selenium — For sites that render content entirely client-side via JavaScript, you'll eventually need a headless browser. Playwright is the current best option. None of the ten sites I targeted required this.

SQLite — CSV is fine for one-off pulls. If you're running scrapers on a schedule and want queryable historical data, sqlite3 is built into Python and requires no configuration.

Questions on any of the techniques? Drop them in the comments.

Tags: python, webdev, automation, tutorial, beginners

DEV Community