Zillow blocks scrapers. Amazon detects bots in milliseconds. Reddit rate-limits anything that isn't a human clicking slowly.
These are the three sites developers most want to scrape — and the three sites that are hardest to scrape reliably.
This post shows you how to scrape all three with 3 lines of Python each. No Playwright. No Selenium. No proxy rotation code. No CAPTCHA solving logic. Just clean data.
Why Scrapers Fail on These Sites
Zillow: JavaScript-rendered listings, Cloudflare protection, and aggressive bot fingerprinting. A raw requests.get() returns a 403 before it even loads the page.
Amazon: Product pages require residential IPs. Cloud IPs (AWS, GCP, Azure) are blacklisted. They also serve fake "bot detected" pages that look like real HTML but contain no product data.
Reddit: Rate-limits at the HTTP level, serves CAPTCHAs to anything that looks automated, and their official API now costs money for anything above toy usage.
The common thread: all three detect non-human traffic patterns and block them at the network level.
The Fix: Residential Proxies + Clean Output
ProxyClaw routes requests through real residential IPs (2M+ across 195 countries), handles CAPTCHA solving automatically, and returns clean Markdown instead of raw HTML.
pip install iploop-sdk
Get a free API key at proxyclaw.ai — 0.5GB free, no credit card.
Scraping Zillow
3 lines:
from iploop import ProxyClawClient
client = ProxyClawClient(api_key="your_api_key")
result = client.fetch("https://www.zillow.com/homes/for_sale/New-York_rb/", format="markdown")
print(result.content)
Output: Clean Markdown with listing addresses, prices, bed/bath counts, and square footage. No HTML parsing. No BeautifulSoup selectors that break every time Zillow redesigns their site.
Getting Structured Data
result = client.fetch(
"https://www.zillow.com/homes/for_sale/New-York_rb/",
format="json",
extract="listings"
)
for listing in result.data["listings"]:
print(f"{listing['address']} — ${listing['price']:,} — {listing['beds']}bd/{listing['baths']}ba")
Scraping Amazon
Product pages, reviews, pricing — 3 lines:
from iploop import ProxyClawClient
client = ProxyClawClient(api_key="your_api_key")
result = client.fetch("https://www.amazon.com/dp/B09X7CRKRZ", format="markdown")
print(result.content)
You get the product title, description, features, pricing, and top reviews in clean Markdown. No fake "Robot Check" pages. No blank HTML shells.
Price Monitoring Example
import json
from iploop import ProxyClawClient
client = ProxyClawClient(api_key="your_api_key")
def get_amazon_price(asin: str) -> dict:
result = client.fetch(
f"https://www.amazon.com/dp/{asin}",
format="json",
extract="product"
)
return {
"asin": asin,
"title": result.data["title"],
"price": result.data["price"],
"rating": result.data["rating"],
"review_count": result.data["review_count"]
}
asins = ["B09X7CRKRZ", "B08N5WRWNW", "B07ZPKN6YR"]
prices = [get_amazon_price(asin) for asin in asins]
print(json.dumps(prices, indent=2))
Run this as a cron job, store to a database, and you have a price tracker with zero browser overhead.
Scraping Reddit
Threads, comments, upvotes — without the API costs:
from iploop import ProxyClawClient
client = ProxyClawClient(api_key="your_api_key")
result = client.fetch("https://www.reddit.com/r/MachineLearning/top/?t=week", format="markdown")
print(result.content)
Pulling Thread Comments
result = client.fetch(
"https://www.reddit.com/r/python/comments/1abc123/some_thread/",
format="json",
extract="comments"
)
for comment in result.data["comments"][:10]:
print(f"[{comment['score']} pts] {comment['author']}: {comment['body'][:200]}")
Useful for sentiment analysis, trend detection, or feeding community discussion into an LLM pipeline.
Combining All Three: A Market Research Agent
from iploop import ProxyClawClient
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
client = ProxyClawClient(api_key="your_api_key")
llm = ChatOpenAI(model="gpt-4o")
def research_market(product_query: str, amazon_asin: str):
amazon = client.fetch(f"https://www.amazon.com/dp/{amazon_asin}", format="markdown")
reddit = client.fetch(
f"https://www.reddit.com/search/?q={product_query.replace(' ', '+')}&sort=top",
format="markdown"
)
context = f"## Amazon\n{amazon.content[:2000]}\n\n## Reddit\n{reddit.content[:2000]}"
response = llm.invoke([HumanMessage(content=
f"Summarize the market for '{product_query}': {context}"
)])
return response.content
print(research_market("wireless earbuds", "B09X7CRKRZ"))
Three data sources. One LLM call. Zero blocked requests.
Performance Notes
ProxyClaw adds ~1-3 seconds of latency per request. For price monitoring, market research, and data pipelines, this is completely acceptable.
For high-frequency scraping, use the batch API:
urls = [
"https://amazon.com/dp/B09X7CRKRZ",
"https://amazon.com/dp/B08N5WRWNW",
"https://zillow.com/homedetails/123-main-st/12345_zpid/"
]
results = client.fetch_batch(urls, format="markdown")
for r in results:
print(r.url, "—", len(r.content), "chars")
Free Tier
0.5GB free per month. No credit card. Roughly 5,000–10,000 page fetches — enough to build and test a full scraping pipeline.
Paid plans: $1.50/GB vs BrightData at $8–15/GB.
Sign up: proxyclaw.ai
What's Next
Wire it into a full agent pipeline — ProxyClaw + LangChain memory + structured output parsers = agents that can autonomously research, monitor, and report on anything on the web.
GitHub: github.com/Iploop/proxyclaw
Docs: iploop.io/docs
Top comments (0)