You've built a LangChain agent. It can reason. It can use tools. It can call APIs.
Then you point it at the web — and it immediately breaks.
CAPTCHAs. IP bans. Cloudflare blocks. JavaScript-heavy pages that return blank HTML. The web was not built for agents, and it actively fights back.
This post shows you how to fix that — using ProxyClaw as a drop-in web access layer for your LangChain agent.
The Problem in One Line
When your LangChain agent tries to fetch a URL, it's making a raw HTTP request from a cloud IP. Every anti-bot system on earth knows that pattern and blocks it instantly.
ProxyClaw routes those requests through 2M+ real residential IPs, solves CAPTCHAs automatically, and returns clean Markdown instead of raw HTML — exactly what an LLM needs.
Install
pip install langchain-proxyclaw
Or if you're using the core SDK:
pip install iploop-sdk
3-Line Integration
Here's the minimal setup:
from langchain_proxyclaw import ProxyClawLoader
loader = ProxyClawLoader(url="https://amazon.com/dp/B09X7CRKRZ")
docs = loader.load()
print(docs[0].page_content)
That's it. ProxyClaw handles the proxy rotation, CAPTCHA solving, and Markdown conversion. You get clean text back, ready to feed into your chain.
Building a Real Agent
Let's build a LangChain agent that can browse any URL and answer questions about it.
from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool
from langchain_openai import ChatOpenAI
from iploop import ProxyClawClient
client = ProxyClawClient(api_key="your_api_key") # Free tier: proxyclaw.ai
def browse_url(url: str) -> str:
"""Fetch a URL and return clean Markdown content."""
result = client.fetch(url, format="markdown")
return result.content[:4000] # trim for context window
browse_tool = Tool(
name="BrowseURL",
func=browse_url,
description="Fetches a URL and returns the page content as clean Markdown. Use this to read web pages, product listings, documentation, or any public URL."
)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = initialize_agent(
tools=[browse_tool],
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
result = agent.run("What are the top reviews for https://amazon.com/dp/B09X7CRKRZ?")
print(result)
The agent now has real web access. It can read product pages, scrape pricing, pull documentation, or gather competitive intel — without getting blocked.
Why Markdown Instead of HTML?
Raw HTML is toxic for LLMs. A typical product page is 90% boilerplate — nav bars, scripts, ads, footer links. An LLM wastes context tokens on noise and misses the signal.
ProxyClaw strips all of that and returns just the content: headings, paragraphs, lists, prices, reviews. The LLM gets clean signal, not garbage.
# Raw HTML approach — bad
import requests
html = requests.get("https://example.com").text
# Result: 50KB of scripts, nav, ads, and maybe 500 words of actual content
# ProxyClaw approach — good
result = client.fetch("https://example.com", format="markdown")
# Result: 800 words of clean, structured Markdown
Handling Dynamic Sites
Some sites only render content via JavaScript — React, Vue, Next.js apps that return an empty shell on first load. ProxyClaw handles this automatically with headless rendering.
result = client.fetch(
url="https://some-spa-app.com/products",
format="markdown",
render_js=True # enables headless browser rendering
)
print(result.content)
No Playwright setup. No Selenium. No managing browser instances in your agent infrastructure. ProxyClaw does it remotely.
Free Tier
ProxyClaw has a free tier: 0.5GB per month, no credit card required.
That's enough for thousands of page fetches in development. Paid plans start at $1.50/GB — compare that to BrightData at $8-15/GB.
Sign up at proxyclaw.ai and get your API key in 30 seconds.
What's Next
In the next article: scraping Zillow, Amazon, and Reddit with 3 lines of Python — no browser, no proxy management, no blocked requests.
GitHub: github.com/Iploop/proxyclaw
Docs: iploop.io/docs
Top comments (0)