DEV Community

Tugelbay Konabayev
Tugelbay Konabayev

Posted on

How to Add Web Browsing to Your RAG Pipeline

You're building a RAG application. Your LLM needs fresh information from the web. ChatGPT has web browsing built in, but your custom pipeline doesn't.

Options:

  1. Google Custom Search API — $5/1,000 queries, returns URLs only (not content)
  2. SerpAPI — $50/month, still just URLs
  3. Scrape it yourself — build HTTP client, handle JS rendering, parse HTML, extract content...
  4. Use a purpose-built tool — this is what I built

The Solution: RAG Web Browser

I built a web browser specifically for RAG pipelines. It searches Google, scrapes the results, and returns clean Markdown.

How It Works

Search Query -> Google SERP -> Top N URLs -> Fetch HTML -> Readability -> Markdown
Enter fullscreen mode Exit fullscreen mode
  1. You send a search query ("best python frameworks 2026")
  2. It queries Google via SERP proxy
  3. Fetches the top N result pages
  4. Extracts main content using Mozilla Readability (same as Firefox Reader View)
  5. Converts to Markdown — compact, LLM-friendly, preserves structure

Python Example

from apify_client import ApifyClient

client = ApifyClient("your-api-token")

run = client.actor("tugelbay/rag-web-browser").call(
    run_input={
        "query": "retrieval augmented generation best practices",
        "maxResults": 3,
        "outputFormat": "markdown",
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"## {item['title']}")
    print(f"Content: {item['markdown'][:200]}...")
Enter fullscreen mode Exit fullscreen mode

LangChain Integration

from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document

apify = ApifyWrapper(apify_api_token="your-token")

docs = apify.call_actor(
    actor_id="tugelbay/rag-web-browser",
    run_input={"query": "RAG best practices", "maxResults": 5},
    dataset_mapping_function=lambda item: Document(
        page_content=item.get("markdown", ""),
        metadata={"url": item["url"], "title": item["title"]},
    ),
)

# Feed into your vector store
for doc in docs:
    vector_store.add_documents([doc])
Enter fullscreen mode Exit fullscreen mode

Why Not Just Use BeautifulSoup?

You could. But:

  1. Google blocks scrapers — you need a SERP proxy
  2. JS-heavy pages — many sites need a browser to render
  3. Content extraction — raw HTML has ads, nav, footers. Readability strips all that
  4. Maintenance — Google changes their HTML, sites change structure. Someone has to maintain the selectors

This actor handles all of that. You just send a query and get Markdown back.

Cost

  • Pay-per-page (PPE pricing) — you pay for each page actually scraped
  • First 100 pages free
  • ~$0.01 per page after that
  • Compare: apify/rag-web-browser charges $20/month flat (rental)

When To Use This

  • RAG pipelines — feed fresh web content into vector databases
  • AI agents — give your agent web browsing capability via MCP
  • Research automation — search a topic, get structured content
  • Content monitoring — track page changes on a schedule

Links

  • Actor: RAG Web Browser on Apify
  • Docs: Full API reference on the actor page
  • MCP compatible: Works with Claude Desktop, GPTs, LangChain, CrewAI

Questions? Drop a comment — happy to discuss RAG architecture and web scraping strategies.

Top comments (0)