You're building a RAG application. Your LLM needs fresh information from the web. ChatGPT has web browsing built in, but your custom pipeline doesn't.
Options:
- Google Custom Search API — $5/1,000 queries, returns URLs only (not content)
- SerpAPI — $50/month, still just URLs
- Scrape it yourself — build HTTP client, handle JS rendering, parse HTML, extract content...
- Use a purpose-built tool — this is what I built
The Solution: RAG Web Browser
I built a web browser specifically for RAG pipelines. It searches Google, scrapes the results, and returns clean Markdown.
How It Works
Search Query -> Google SERP -> Top N URLs -> Fetch HTML -> Readability -> Markdown
- You send a search query ("best python frameworks 2026")
- It queries Google via SERP proxy
- Fetches the top N result pages
- Extracts main content using Mozilla Readability (same as Firefox Reader View)
- Converts to Markdown — compact, LLM-friendly, preserves structure
Python Example
from apify_client import ApifyClient
client = ApifyClient("your-api-token")
run = client.actor("tugelbay/rag-web-browser").call(
run_input={
"query": "retrieval augmented generation best practices",
"maxResults": 3,
"outputFormat": "markdown",
}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"## {item['title']}")
print(f"Content: {item['markdown'][:200]}...")
LangChain Integration
from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document
apify = ApifyWrapper(apify_api_token="your-token")
docs = apify.call_actor(
actor_id="tugelbay/rag-web-browser",
run_input={"query": "RAG best practices", "maxResults": 5},
dataset_mapping_function=lambda item: Document(
page_content=item.get("markdown", ""),
metadata={"url": item["url"], "title": item["title"]},
),
)
# Feed into your vector store
for doc in docs:
vector_store.add_documents([doc])
Why Not Just Use BeautifulSoup?
You could. But:
- Google blocks scrapers — you need a SERP proxy
- JS-heavy pages — many sites need a browser to render
- Content extraction — raw HTML has ads, nav, footers. Readability strips all that
- Maintenance — Google changes their HTML, sites change structure. Someone has to maintain the selectors
This actor handles all of that. You just send a query and get Markdown back.
Cost
- Pay-per-page (PPE pricing) — you pay for each page actually scraped
- First 100 pages free
- ~$0.01 per page after that
- Compare: apify/rag-web-browser charges $20/month flat (rental)
When To Use This
- RAG pipelines — feed fresh web content into vector databases
- AI agents — give your agent web browsing capability via MCP
- Research automation — search a topic, get structured content
- Content monitoring — track page changes on a schedule
Links
- Actor: RAG Web Browser on Apify
- Docs: Full API reference on the actor page
- MCP compatible: Works with Claude Desktop, GPTs, LangChain, CrewAI
Questions? Drop a comment — happy to discuss RAG architecture and web scraping strategies.
Top comments (0)