DEV Community

PROXYCLAW
PROXYCLAW

Posted on

Scraping Zillow, Amazon, and Reddit with 3 Lines of Python (No Browser, No Blocks)

Zillow blocks scrapers. Amazon detects bots in milliseconds. Reddit rate-limits anything that isn't a human clicking slowly.

These are the three sites developers most want to scrape — and the three sites that are hardest to scrape reliably.

This post shows you how to scrape all three with 3 lines of Python each. No Playwright. No Selenium. No proxy rotation code. No CAPTCHA solving logic. Just clean data.

Why Scrapers Fail on These Sites

Zillow: JavaScript-rendered listings, Cloudflare protection, and aggressive bot fingerprinting. A raw requests.get() returns a 403 before it even loads the page.

Amazon: Product pages require residential IPs. Cloud IPs (AWS, GCP, Azure) are blacklisted. They also serve fake "bot detected" pages that look like real HTML but contain no product data.

Reddit: Rate-limits at the HTTP level, serves CAPTCHAs to anything that looks automated, and their official API now costs money for anything above toy usage.

The common thread: all three detect non-human traffic patterns and block them at the network level.

The Fix: Residential Proxies + Clean Output

ProxyClaw routes requests through real residential IPs (2M+ across 195 countries), handles CAPTCHA solving automatically, and returns clean Markdown instead of raw HTML.

pip install iploop-sdk
Enter fullscreen mode Exit fullscreen mode

Get a free API key at proxyclaw.ai — 0.5GB free, no credit card.

Scraping Zillow

3 lines:

from iploop import ProxyClawClient

client = ProxyClawClient(api_key="your_api_key")
result = client.fetch("https://www.zillow.com/homes/for_sale/New-York_rb/", format="markdown")
print(result.content)
Enter fullscreen mode Exit fullscreen mode

Output: Clean Markdown with listing addresses, prices, bed/bath counts, and square footage. No HTML parsing. No BeautifulSoup selectors that break every time Zillow redesigns their site.

Getting Structured Data

result = client.fetch(
    "https://www.zillow.com/homes/for_sale/New-York_rb/",
    format="json",
    extract="listings"
)

for listing in result.data["listings"]:
    print(f"{listing['address']} — ${listing['price']:,}{listing['beds']}bd/{listing['baths']}ba")
Enter fullscreen mode Exit fullscreen mode

Scraping Amazon

Product pages, reviews, pricing — 3 lines:

from iploop import ProxyClawClient

client = ProxyClawClient(api_key="your_api_key")
result = client.fetch("https://www.amazon.com/dp/B09X7CRKRZ", format="markdown")
print(result.content)
Enter fullscreen mode Exit fullscreen mode

You get the product title, description, features, pricing, and top reviews in clean Markdown. No fake "Robot Check" pages. No blank HTML shells.

Price Monitoring Example

import json
from iploop import ProxyClawClient

client = ProxyClawClient(api_key="your_api_key")

def get_amazon_price(asin: str) -> dict:
    result = client.fetch(
        f"https://www.amazon.com/dp/{asin}",
        format="json",
        extract="product"
    )
    return {
        "asin": asin,
        "title": result.data["title"],
        "price": result.data["price"],
        "rating": result.data["rating"],
        "review_count": result.data["review_count"]
    }

asins = ["B09X7CRKRZ", "B08N5WRWNW", "B07ZPKN6YR"]
prices = [get_amazon_price(asin) for asin in asins]
print(json.dumps(prices, indent=2))
Enter fullscreen mode Exit fullscreen mode

Run this as a cron job, store to a database, and you have a price tracker with zero browser overhead.

Scraping Reddit

Threads, comments, upvotes — without the API costs:

from iploop import ProxyClawClient

client = ProxyClawClient(api_key="your_api_key")
result = client.fetch("https://www.reddit.com/r/MachineLearning/top/?t=week", format="markdown")
print(result.content)
Enter fullscreen mode Exit fullscreen mode

Pulling Thread Comments

result = client.fetch(
    "https://www.reddit.com/r/python/comments/1abc123/some_thread/",
    format="json",
    extract="comments"
)

for comment in result.data["comments"][:10]:
    print(f"[{comment['score']} pts] {comment['author']}: {comment['body'][:200]}")
Enter fullscreen mode Exit fullscreen mode

Useful for sentiment analysis, trend detection, or feeding community discussion into an LLM pipeline.

Combining All Three: A Market Research Agent

from iploop import ProxyClawClient
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

client = ProxyClawClient(api_key="your_api_key")
llm = ChatOpenAI(model="gpt-4o")

def research_market(product_query: str, amazon_asin: str):
    amazon = client.fetch(f"https://www.amazon.com/dp/{amazon_asin}", format="markdown")
    reddit = client.fetch(
        f"https://www.reddit.com/search/?q={product_query.replace(' ', '+')}&sort=top",
        format="markdown"
    )
    context = f"## Amazon\n{amazon.content[:2000]}\n\n## Reddit\n{reddit.content[:2000]}"
    response = llm.invoke([HumanMessage(content=
        f"Summarize the market for '{product_query}': {context}"
    )])
    return response.content

print(research_market("wireless earbuds", "B09X7CRKRZ"))
Enter fullscreen mode Exit fullscreen mode

Three data sources. One LLM call. Zero blocked requests.

Performance Notes

ProxyClaw adds ~1-3 seconds of latency per request. For price monitoring, market research, and data pipelines, this is completely acceptable.

For high-frequency scraping, use the batch API:

urls = [
    "https://amazon.com/dp/B09X7CRKRZ",
    "https://amazon.com/dp/B08N5WRWNW",
    "https://zillow.com/homedetails/123-main-st/12345_zpid/"
]

results = client.fetch_batch(urls, format="markdown")
for r in results:
    print(r.url, "", len(r.content), "chars")
Enter fullscreen mode Exit fullscreen mode

Free Tier

0.5GB free per month. No credit card. Roughly 5,000–10,000 page fetches — enough to build and test a full scraping pipeline.

Paid plans: $1.50/GB vs BrightData at $8–15/GB.

Sign up: proxyclaw.ai

What's Next

Wire it into a full agent pipeline — ProxyClaw + LangChain memory + structured output parsers = agents that can autonomously research, monitor, and report on anything on the web.

GitHub: github.com/Iploop/proxyclaw
Docs: iploop.io/docs

Top comments (0)