How to Add Web Browsing to Your RAG Pipeline

#webdev #python #ai #tutorial

You're building a RAG application. Your LLM needs fresh information from the web. ChatGPT has web browsing built in, but your custom pipeline doesn't.

Options:

Google Custom Search API — $5/1,000 queries, returns URLs only (not content)
SerpAPI — $50/month, still just URLs
Scrape it yourself — build HTTP client, handle JS rendering, parse HTML, extract content...
Use a purpose-built tool — this is what I built

The Solution: RAG Web Browser

I built a web browser specifically for RAG pipelines. It searches Google, scrapes the results, and returns clean Markdown.

How It Works

Search Query -> Google SERP -> Top N URLs -> Fetch HTML -> Readability -> Markdown

You send a search query ("best python frameworks 2026")
It queries Google via SERP proxy
Fetches the top N result pages
Extracts main content using Mozilla Readability (same as Firefox Reader View)
Converts to Markdown — compact, LLM-friendly, preserves structure

Python Example

from apify_client import ApifyClient

client = ApifyClient("your-api-token")

run = client.actor("tugelbay/rag-web-browser").call(
    run_input={
        "query": "retrieval augmented generation best practices",
        "maxResults": 3,
        "outputFormat": "markdown",
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"## {item['title']}")
    print(f"Content: {item['markdown'][:200]}...")

LangChain Integration

from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document

apify = ApifyWrapper(apify_api_token="your-token")

docs = apify.call_actor(
    actor_id="tugelbay/rag-web-browser",
    run_input={"query": "RAG best practices", "maxResults": 5},
    dataset_mapping_function=lambda item: Document(
        page_content=item.get("markdown", ""),
        metadata={"url": item["url"], "title": item["title"]},
    ),
)

# Feed into your vector store
for doc in docs:
    vector_store.add_documents([doc])

Why Not Just Use BeautifulSoup?

You could. But:

Google blocks scrapers — you need a SERP proxy
JS-heavy pages — many sites need a browser to render
Content extraction — raw HTML has ads, nav, footers. Readability strips all that
Maintenance — Google changes their HTML, sites change structure. Someone has to maintain the selectors

This actor handles all of that. You just send a query and get Markdown back.

Cost

Pay-per-page (PPE pricing) — you pay for each page actually scraped
First 100 pages free
~$0.01 per page after that
Compare: apify/rag-web-browser charges $20/month flat (rental)

When To Use This

RAG pipelines — feed fresh web content into vector databases
AI agents — give your agent web browsing capability via MCP
Research automation — search a topic, get structured content
Content monitoring — track page changes on a schedule

DEV Community