ZyVOP

Posted on Jun 4 • Edited on Jun 8 • Originally published at zyvop.com

How to Scrape Websites for AI Training Data & RAG Pipelines with Python

#ai #rag #retrievalaugmentedgeneration #webscraping

Why Every AI Developer Needs to Know How to Scrape the Web

Here's the uncomfortable truth about building AI applications in 2026: the quality of your output is only as good as the quality of your input data.

Pre-trained models like GPT-4 or Claude have a knowledge cutoff. They don't know about the paper published last Tuesday, the product launched last month, or the price change that happened this morning. If your AI application needs current, domain-specific, or proprietary knowledge, you have exactly two options:

Fine-tune the model on your own data (expensive, slow, requires thousands of examples)
Use Retrieval-Augmented Generation (RAG) — give the model real-time access to scraped, structured knowledge

Option 2 is almost always the right answer. And web scraping is the engine that powers it.

According to Zyte's 2026 Web Scraping Industry Report, AI-powered code generation, LLM-based extraction, and intelligent browser automation are compressing development cycles dramatically — and a growing share of scraping pipelines now feed directly into LLM workflows.

The web scraping software market sits at $1.17 billion in 2026, growing at an 18.5% CAGR — and the AI-powered data extraction segment specifically is projected at $7.48 billion, trending toward $38.44 billion. Web scraping isn't adjacent to the AI wave. It's riding it.

This guide shows you exactly how to build a complete pipeline: scrape web content → clean it → embed it → store it in a vector database → query it with an LLM.

What Is a RAG Pipeline and Why Does It Need Web Data?

Large language models are limited by their training data. They don't know about the documentation update published yesterday, the product released this morning, or the article posted five minutes ago.

Retrieval-Augmented Generation (RAG) solves this by allowing an LLM to retrieve relevant information from an external knowledge base before generating a response. Instead of answering purely from memory, the model answers using information retrieved at query time.

User Question
     ↓
Similarity Search → Vector Database → Retrieve Relevant Chunks
                                              ↓
                              LLM (GPT-4 / Claude / Llama) + Context
                                              ↓
                                    Grounded, Accurate Answer

The quality of a RAG system depends heavily on the quality of the data inside that knowledge base. Documentation sites, research papers, blogs, news articles, and internal company content all need to be collected and kept fresh, which is where web scraping comes in.

The challenge is that raw HTML contains a huge amount of noise: navigation menus, cookie banners, footers, advertisements, and tracking scripts. Modern AI-focused scrapers remove that noise and produce clean, structured content that can be embedded, stored in vector databases, and retrieved efficiently by LLMs.

The AI Scraping Stack in 2026

Before we write code, here are the tools we'll use and why:

Tool	Role	Why It Matters for AI
Crawl4AI	Open-source crawler optimised for RAG	Returns clean Markdown, handles JS, free
Firecrawl	Managed API for LLM-ready content	Zero infra, markdown output, JS rendering
ScrapeGraphAI	LLM-powered extraction via natural language	No CSS selectors, adapts to layout changes
ChromaDB	Local vector database	Stores and queries embeddings
sentence-transformers	Embedding model	Converts text chunks to vectors
LangChain	RAG orchestration	Connects scrapers, embeddings, and LLMs

Part 1: Crawl4AI — The Open-Source RAG Scraper

Crawl4AI is an open-source Python crawler built specifically for RAG pipelines. It generates clean Markdown optimised for RAG with BM25-based content filtering, supports LLM-powered structured extraction, and handles full-site crawling with link following and depth control — with no per-request costs.

This makes it the go-to choice for teams who want full control without paying per-request fees.

Installation

pip install crawl4ai
crawl4ai-setup   # Downloads browser binaries — required first time

Basic usage: scrape a single page to clean Markdown

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def scrape_to_markdown(url: str) -> str:
    """
    Scrape a URL and return clean, LLM-ready Markdown.
    Crawl4AI strips navigation, footers, ads, and boilerplate automatically.
    """
    async with AsyncWebCrawler(verbose=False) as crawler:
        result = await crawler.arun(
            url=url,
            word_count_threshold=10,     # Skip tiny elements (nav links, buttons)
            remove_overlay_elements=True, # Remove popups, cookie banners
            process_iframes=False,
        )

    if result.success:
        return result.markdown_v2.raw_markdown
    else:
        raise Exception(f"Crawl failed: {result.error_message}")

# Test it
markdown = asyncio.run(scrape_to_markdown("https://docs.python.org/3/library/asyncio.html"))
print(markdown[:500])
print(f"\nTotal length: {len(markdown)} characters")

The output is clean, stripped Markdown with no HTML tags, no navigation menus, no cookie consent text — exactly what an LLM needs.

Crawling an entire documentation site

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.chunking_strategy import RegexChunking
from urllib.parse import urljoin, urlparse
import json

async def crawl_documentation_site(
    start_url: str,
    max_pages: int = 50,
    same_domain_only: bool = True
) -> list[dict]:
    """
    Crawl a documentation site and return a list of
    {'url': ..., 'title': ..., 'content': ...} dicts — ready for RAG ingestion.
    """
    base_domain = urlparse(start_url).netloc
    visited = set()
    queue = [start_url]
    pages = []

    async with AsyncWebCrawler(verbose=False) as crawler:
        while queue and len(pages) < max_pages:
            url = queue.pop(0)
            if url in visited:
                continue
            visited.add(url)

            print(f"[{len(pages)+1}/{max_pages}] Crawling: {url}")
            result = await crawler.arun(
                url=url,
                word_count_threshold=15,
                remove_overlay_elements=True,
            )

            if not result.success:
                continue

            pages.append({
                "url": url,
                "title": result.metadata.get("title", ""),
                "content": result.markdown_v2.raw_markdown,
                "scraped_at": result.metadata.get("timestamp", ""),
            })

            # Discover new links on the same domain
            if same_domain_only:
                for link in (result.links.get("internal", []) or []):
                    href = link.get("href", "")
                    if href and href not in visited:
                        parsed = urlparse(href)
                        if parsed.netloc == base_domain or not parsed.netloc:
                            queue.append(urljoin(url, href))

    print(f"\nCrawled {len(pages)} pages.")
    return pages

pages = asyncio.run(crawl_documentation_site(
    "https://docs.python.org/3/",
    max_pages=30
))

# Save raw crawl output
with open("docs_crawl.json", "w") as f:
    json.dump(pages, f, indent=2, ensure_ascii=False)

Part 2: Cleaning and Chunking for Optimal RAG Performance

Raw page content needs to be split into chunks before embedding. The chunk size is critical: too small and you lose context; too large and retrieval becomes imprecise.

from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

def chunk_markdown_page(page: dict) -> list[dict]:
    """
    Split a Markdown page into semantically meaningful chunks.
    Preserve heading context in each chunk for better retrieval.
    """

    # First: split by Markdown headers to preserve semantic sections
    headers_to_split = [
        ("#",  "h1"),
        ("##", "h2"),
        ("###","h3"),
    ]
    header_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split,
        strip_headers=False  # Keep headers in chunk text for context
    )

    # Then: split long sections by character count
    char_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,          # ~600 tokens — sweet spot for most models
        chunk_overlap=80,        # 10% overlap prevents losing context at boundaries
        separators=["\n\n", "\n", ". ", " ", ""],
    )

    chunks = []
    header_docs = header_splitter.split_text(page["content"])

    for doc in header_docs:
        # If section is short enough, keep as one chunk
        if len(doc.page_content) <= 900:
            chunks.append({
                "content": doc.page_content,
                "source_url": page["url"],
                "page_title": page["title"],
                "section": doc.metadata.get("h2") or doc.metadata.get("h1", ""),
                "char_count": len(doc.page_content),
            })
        else:
            # Split large sections further
            sub_chunks = char_splitter.split_text(doc.page_content)
            for i, chunk_text in enumerate(sub_chunks):
                chunks.append({
                    "content": chunk_text,
                    "source_url": page["url"],
                    "page_title": page["title"],
                    "section": doc.metadata.get("h2") or doc.metadata.get("h1", ""),
                    "chunk_index": i,
                    "char_count": len(chunk_text),
                })

    return chunks

# Process all pages
all_chunks = []
for page in pages:
    chunks = chunk_markdown_page(page)
    all_chunks.extend(chunks)

print(f"Total chunks: {len(all_chunks)}")
print(f"Average chunk size: {sum(c['char_count'] for c in all_chunks) / len(all_chunks):.0f} chars")

Part 3: Embedding and Storing in ChromaDB

from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid

# Load embedding model (runs locally, no API key needed)
# "all-MiniLM-L6-v2" is fast and good; use "all-mpnet-base-v2" for higher quality
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Initialise ChromaDB (local persistent storage)
chroma_client = chromadb.PersistentClient(
    path="./chroma_db",
    settings=Settings(anonymized_telemetry=False)
)

collection = chroma_client.get_or_create_collection(
    name="python_docs",
    metadata={"hnsw:space": "cosine"}   # Cosine similarity for text
)

def embed_and_store(chunks: list[dict], batch_size: int = 64):
    """Embed chunks and store in ChromaDB with metadata."""

    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c["content"] for c in batch]

        print(f"Embedding batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1}...")
        embeddings = embedder.encode(texts, show_progress_bar=False).tolist()

        collection.add(
            ids=[str(uuid.uuid4()) for _ in batch],
            embeddings=embeddings,
            documents=texts,
            metadatas=[{
                "source_url":  c["source_url"],
                "page_title":  c["page_title"],
                "section":     c.get("section", ""),
            } for c in batch]
        )

    print(f"Stored {len(chunks)} chunks in ChromaDB.")

embed_and_store(all_chunks)
print(f"Collection size: {collection.count()} documents")

Part 4: Querying — The RAG Retrieval Loop

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """
    Embed a query and retrieve the most relevant chunks from ChromaDB.
    """
    query_embedding = embedder.encode([query]).tolist()

    results = collection.query(
        query_embeddings=query_embedding,
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    retrieved = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        retrieved.append({
            "content":    doc,
            "source_url": meta["source_url"],
            "section":    meta["section"],
            "similarity": round(1 - dist, 3),   # Convert cosine distance to similarity
        })

    return retrieved

# Test retrieval
results = retrieve("How does asyncio event loop work?", top_k=3)
for r in results:
    print(f"\n--- {r['section']} (similarity: {r['similarity']}) ---")
    print(r["content"][:200])
    print(f"Source: {r['source_url']}")

Part 5: The Full RAG Answer Pipeline

Now we connect retrieval to an LLM for grounded, cited answers:

import anthropic   # or: from openai import OpenAI

client = anthropic.Anthropic()   # Uses ANTHROPIC_API_KEY env variable

def rag_answer(question: str, top_k: int = 4) -> dict:
    """
    Full RAG pipeline: retrieve relevant chunks → build prompt → get LLM answer.
    Returns the answer and source citations.
    """
    # Step 1: Retrieve relevant context
    chunks = retrieve(question, top_k=top_k)

    if not chunks:
        return {"answer": "No relevant information found.", "sources": []}

    # Step 2: Build the context block
    context_parts = []
    for i, chunk in enumerate(chunks):
        context_parts.append(
            f"[Source {i+1}: {chunk['section']} — {chunk['source_url']}]\n"
            f"{chunk['content']}"
        )
    context = "\n\n---\n\n".join(context_parts)

    # Step 3: Construct the prompt
    system_prompt = """You are a helpful technical assistant. Answer the user's
question based ONLY on the provided context. If the context doesn't contain
enough information to answer fully, say so clearly. Always cite your sources
using the [Source N] notation from the context."""

    user_prompt = f"""Context:
{context}

Question: {question}

Answer (cite sources):"""

    # Step 4: Call the LLM
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        system=system_prompt,
        messages=[{"role": "user", "content": user_prompt}]
    )

    return {
        "answer":  response.content[0].text,
        "sources": [c["source_url"] for c in chunks],
        "chunks_used": len(chunks),
    }

# Test the full pipeline
result = rag_answer("How do I use asyncio.gather() with error handling?")
print(result["answer"])
print("\nSources:")
for src in result["sources"]:
    print(f"  - {src}")

Part 6: Using ScrapeGraphAI for Natural Language Extraction

For structured extraction without writing CSS selectors, ScrapeGraphAI crossed 15,000 GitHub stars by letting developers describe extraction requirements in plain English — the LLM builds the scraper without XPath or CSS selectors.

pip install scrapegraphai

from scrapegraphai.graphs import SmartScraperGraph

# Define your scraping task in plain English
graph_config = {
    "llm": {
        "api_key": "YOUR_ANTHROPIC_OR_OPENAI_KEY",
        "model": "claude-sonnet-4-20250514",
    },
    "verbose": False,
    "headless": True,
}

# Describe what you want — no selectors needed
scraper = SmartScraperGraph(
    prompt="""Extract all research papers listed on this page.
              For each paper return: title, authors (as a list),
              publication date, abstract (first 2 sentences), and URL.""",
    source="https://arxiv.org/list/cs.AI/recent",
    config=graph_config,
)

result = scraper.run()

# Returns structured JSON matching your description
import json
print(json.dumps(result, indent=2))

The key advantage: when the website redesigns and changes its CSS classes, your scraper still works because it understands the content, not the structure.

Part 7: Keeping Your RAG Knowledge Base Fresh

A static scrape gets stale. Here's a lightweight refresh scheduler:

import asyncio
import json
from datetime import datetime, timezone, timedelta
import chromadb

REFRESH_INTERVAL_HOURS = 24
SOURCES = [
    "https://docs.python.org/3/",
    "https://playwright.dev/python/docs/intro",
    "https://docs.scrapy.org/en/latest/",
]

async def refresh_knowledge_base():
    """Re-scrape all sources and update ChromaDB with new content."""
    print(f"Starting knowledge base refresh at {datetime.now().isoformat()}")

    # Track which documents are new vs unchanged
    new_count = updated_count = 0

    for source_url in SOURCES:
        print(f"Re-crawling: {source_url}")
        pages = await crawl_documentation_site(source_url, max_pages=20)

        for page in pages:
            # Check if this URL was scraped recently
            existing = collection.get(
                where={"source_url": page["url"]},
                limit=1
            )
            if existing["ids"]:
                # Delete old chunks for this URL before re-adding
                collection.delete(where={"source_url": page["url"]})
                updated_count += 1
            else:
                new_count += 1

            chunks = chunk_markdown_page(page)
            embed_and_store(chunks)

    print(f"Refresh complete — {new_count} new pages, {updated_count} updated.")
    return {"new": new_count, "updated": updated_count}

# Schedule with asyncio
async def scheduled_refresh():
    while True:
        await refresh_knowledge_base()
        print(f"Next refresh in {REFRESH_INTERVAL_HOURS} hours.")
        await asyncio.sleep(REFRESH_INTERVAL_HOURS * 3600)

# asyncio.run(scheduled_refresh())   # Uncomment to run continuously

Choosing the Right Tool for Your AI Scraping Needs

Use case	Best tool	Reason
Scraping docs for RAG, self-hosted	Crawl4AI	Free, open-source, Markdown output
Scraping JS-heavy sites at scale	Firecrawl API	Handles anti-bot, clean Markdown
Extracting structured data (no selectors)	ScrapeGraphAI	Natural language prompts, JSON output
Feeding LangChain / LlamaIndex	Any + LangChain loaders	Direct integration available
Training dataset construction	httpx + Crawl4AI	Volume + low cost

Performance Numbers: What to Expect

For a typical documentation RAG pipeline:

Metric	Value
Crawl speed (Crawl4AI, 10 concurrent)	~8 pages/second
Markdown size vs raw HTML	~15% of original (85% reduction)
Embedding time (MiniLM-L6-v2, CPU)	~1,000 chunks/minute
ChromaDB query latency (100k docs)	~15ms
RAG answer latency (retrieval + LLM)	~1.2s average

FAQ

Q: Is it legal to scrape websites for AI training data? The legal landscape is complex and evolving rapidly. Publicly accessible data has generally been treated as fair game under the theory of implied licence, but several lawsuits (particularly against AI companies scraping copyrighted content) are ongoing. Always check robots.txt, the site's Terms of Service, and consult legal counsel for commercial applications.

Q: What's the difference between Crawl4AI and Firecrawl? Crawl4AI is open-source and self-hosted — you run it yourself, no API costs. Firecrawl is a managed API — you pay per request, but it handles infrastructure, JavaScript rendering, and anti-bot bypass automatically. For prototyping, Crawl4AI. For production at scale, Firecrawl.

Q: How many chunks should a RAG database have? For a focused domain (e.g., one product's docs), 5,000–50,000 chunks works well with ChromaDB. For large-scale knowledge bases (hundreds of websites), use managed vector DBs like Pinecone or Weaviate which are optimised for billions of vectors.

Q: What chunk size is best for RAG? 800–1,000 characters (~600–750 tokens) with 10–15% overlap is the empirically best-performing range for most embedding models. Smaller chunks (under 300 chars) lose context; larger chunks (over 1,500 chars) degrade retrieval precision.

Q: Can I use this approach with local LLMs? Absolutely. Replace the Anthropic API call with Ollama (for local Llama 3 or Mistral). The retrieval pipeline is identical; only the final generation step changes.

Summary

Step	Tool	What it does
1. Crawl	Crawl4AI / Firecrawl	Scrape pages → clean Markdown
2. Chunk	LangChain TextSplitter	Split into retrieval-optimised segments
3. Embed	sentence-transformers	Convert text to semantic vectors
4. Store	ChromaDB	Index and persist embeddings
5. Retrieve	ChromaDB query	Find most relevant chunks for a question
6. Generate	Claude / GPT-4	Answer grounded in retrieved context
7. Refresh	asyncio scheduler	Keep knowledge base current

Five years ago, web scraping was mostly associated with lead generation, price monitoring, and SEO tooling. In 2026, its role is much larger. Every RAG system, AI research assistant, internal knowledge bot, and real-time LLM application needs a way to collect fresh information. Models provide reasoning. Scrapers provide knowledge.

The developers who understand both will have a significant advantage over those who only understand one side of the stack.

Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

DEV Community