Web Scraping Pipeline for RAG: Clean Data for LLMs

#api #python #datapipelines #ai

Web Scraping Pipeline for RAG: Feed Clean Data into Your LLM Without Token Waste

Raw HTML is poison for RAG. A typical news article page is 45,000 characters—roughly 11,000 tokens. The actual article is 800 words, or about 1,100 tokens. You are paying 10× to embed navigation menus, cookie banners, footer links, and inline scripts that actively dilute your embeddings and degrade retrieval quality.

The fix is a five-stage pipeline: reliable fetch → content extraction → normalization → semantic chunking → embed and index. Each stage has a single responsibility. Each failure is isolated and debuggable. This post walks through a production implementation in Python.

Pipeline Architecture

Stage 1: Reliable Fetching

The hardest part of scraping at scale is not parsing—it is getting the HTML. Bot detection blocks requests. JavaScript-rendered SPAs return skeleton HTML to static fetches. IP ranges accumulate blocks.

AlterLab's scraping API handles this in a single POST: rotating residential proxies, automatic CAPTCHA bypass, and optional headless rendering without managing a browser fleet yourself.

Python:

```python title="fetch.py" {8-19}

ALTERLAB_API_KEY = "YOUR_API_KEY"
ALTERLAB_BASE_URL = "https://api.alterlab.io/v1"

def fetch_page(url: str, render_js: bool = False) -> str:
"""Fetch fully-rendered HTML from any URL."""
response = httpx.post(
f"{ALTERLAB_BASE_URL}/scrape",
headers={"X-API-Key": ALTERLAB_API_KEY, "Content-Type": "application/json"},
json={
"url": url,
"render_js": render_js,
"wait_for": "networkidle" if render_js else None,
},
timeout=30,
)
response.raise_for_status()
return response.json()["html"]




**cURL:**



```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/docs/getting-started",
    "render_js": true,
    "wait_for": "networkidle"
  }'

--- ## Stage 2: Content Extraction `trafilatura` is the most accurate open-source library for pulling article body text from HTML. It outperforms `readability-lxml` and `newspaper3k` on structured documentation and blog content because it uses both DOM heuristics and text-density scoring. ```python title="extract.py" {1-3,14-25} from trafilatura.settings import use_config # Disable per-document timeout—let your own retry logic own the clock config = use_config() config.set("DEFAULT", "EXTRACTION_TIMEOUT", "0") def extract_content(html: str, url: str) -> dict: """ Extract main content from HTML. Returns dict with keys: text, title, author, date, description. Raises ValueError if no content can be extracted. """ result = trafilatura.extract( html, url=url, include_comments=False, include_tables=True, no_fallback=False, config=config, output_format="json", with_metadata=True, ) if result is None: raise ValueError(f"Extraction returned no content for {url}") return json.loads(result) ``` Set `no_fallback=False` to allow trafilatura to fall back to its secondary heuristic if the primary DOM analysis returns nothing—useful for pages with unconventional layouts. --- ## Stage 3: Normalization After extraction, text still contains artifacts: Unicode non-breaking spaces (`\u00a0`), zero-width joiners, smart quotes, triple-newline runs from CMS templates, and stub lines that are purely punctuation. ```python title="normalize.py" {5-18} def normalize_text(text: str) -> str: # Canonical Unicode form: convert smart quotes, em-dashes, ligatures text = unicodedata.normalize("NFKC", text) # Replace invisible/non-breaking whitespace variants text = re.sub(r"[\u00a0\u200b\u200c\u200d\ufeff]", " ", text) # Collapse horizontal whitespace, preserve single newlines text = re.sub(r"[ \t]+", " ", text) text = re.sub(r"\n{3,}", "\n\n", text) # Drop lines shorter than 4 chars (nav artifacts: "›", "|", "»") lines = [ln for ln in text.split("\n") if len(ln.strip()) > 3 or ln.strip() == ""] return "\n".join(lines).strip() ``` This pass runs in microseconds per document and prevents garbage tokens from reaching your embedding model. --- ## Stage 4: Chunking Strategy Three mistakes that kill retrieval quality: - **Fixed character splits** break sentences mid-clause. The embedding for a sentence fragment does not represent a complete thought. - **Whole documents as single vectors** average all content into one point in embedding space. Specific queries retrieve nothing useful. - **Zero overlap** means a concept bridging two chunks never matches a query that references it as a unit. Use recursive sentence-aware chunking with configurable overlap: ```python title="chunker.py" {19-48} from __future__ import annotations from dataclasses import dataclass, field @dataclass class Chunk: text: str url: str chunk_index: int total_chunks: int metadata: dict = field(default_factory=dict) def split_sentences(text: str) -> list[str]: """Sentence-boundary split on terminal punctuation followed by uppercase.""" return re.split(r"(?<=[.!?])\s+(?=[A-Z\"])", text) def chunk_document( text: str, url: str, max_tokens: int = 400, overlap_sentences: int = 2, chars_per_token: float = 4.0, ) -> list[Chunk]: """ Split text into token-bounded chunks with sentence-level overlap. Args: max_tokens: Approximate token ceiling per chunk. overlap_sentences: Sentences carried over to the next chunk. chars_per_token: Heuristic for English prose (4.0 is reliable). """ max_chars = int(max_tokens * chars_per_token) sentences = split_sentences(text) raw_chunks: list[str] = [] current: list[str] = [] current_len = 0 for sentence in sentences: slen = len(sentence) if current_len + slen > max_chars and current: raw_chunks.append(" ".join(current)) current = current[-overlap_sentences:] current_len = sum(len(s) for s in current) current.append(sentence) current_len += slen if current: raw_chunks.append(" ".join(current)) total = len(raw_chunks) return [ Chunk(text=t, url=url, chunk_index=i, total_chunks=total) for i, t in enumerate(raw_chunks) ] ``` **Token ceiling guidelines by model:**

Model	Recommended max_tokens	Overlap Sentences	Notes
text-embedding-3-small	400	2	Good default for mixed content
text-embedding-3-large	600	2	Better for long-form technical docs
nomic-embed-text	512	3	Open-source; strong on code + prose
BGE-M3	800	2	Multilingual; 8192-token context

Stage 5: Embedding and Indexing

Batch your embedding calls. The OpenAI embeddings API accepts up to 2,048 inputs per request—sending one chunk per call is 100× slower and burns rate limit quota unnecessarily.

```python title="embed.py" {27-48}

from openai import AsyncOpenAI

openai_client = AsyncOpenAI()

def init_index(api_key: str, index_name: str) -> pinecone.Index:
pc = pinecone.Pinecone(api_key=api_key)
return pc.Index(index_name)

async def embed_texts(texts: list[str]) -> list[list[float]]:
"""Batch embed up to 2048 texts in a single API call."""
response = await openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
encoding_format="float",
)
return [item.embedding for item in response.data]

async def index_chunks(
chunks: list["Chunk"],
index: pinecone.Index,
batch_size: int = 100,
) -> None:
"""Embed and upsert chunks into Pinecone with source metadata preserved."""
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
vectors = await embed_texts([c.text for c in batch])

    upserts = [
        {
            "id": f"{c.url}::{c.chunk_index}",
            "values": vectors[j],
            "metadata": {
                "url": c.url,
                "chunk_index": c.chunk_index,
                "total_chunks": c.total_chunks,
                "text": c.text,  # store inline—avoids a separate fetch at query time
            },
        }
        for j, c in enumerate(batch)
    ]

    index.upsert(vectors=upserts)




Store `text` in the vector metadata. Fetching the source document at query time adds latency and a failure point; paying a few extra bytes per vector is worth it.

---

## Full Pipeline



```python title="pipeline.py" {11-55}

from fetch import fetch_page
from extract import extract_content
from normalize import normalize_text
from chunker import Chunk, chunk_document
from embed import init_index, index_chunks

PINECONE_API_KEY = "YOUR_PINECONE_KEY"
PINECONE_INDEX = "rag-knowledge-base"

async def ingest_url(url: str, render_js: bool = False) -> dict:
    """
    End-to-end pipeline: URL → indexed, retrievable chunks.
    Returns a summary dict for logging and monitoring.
    """
    # Stage 1: Fetch
    html = fetch_page(url, render_js=render_js)

    # Stage 2: Extract
    extracted = extract_content(html, url)
    raw_text = extracted.get("text", "")
    title = extracted.get("title", "untitled")

    if not raw_text:
        return {"url": url, "status": "no_content", "chunks": 0}

    # Stage 3: Normalize
    clean_text = normalize_text(raw_text)

    # Stage 4: Chunk
    chunks = chunk_document(
        text=clean_text,
        url=url,
        max_tokens=400,
        overlap_sentences=2,
    )

    # Filter degenerate chunks before embedding
    chunks = [c for c in chunks if len(c.text.split()) >= 15]

    # Stage 5: Embed + index
    index = init_index(PINECONE_API_KEY, PINECONE_INDEX)
    await index_chunks(chunks, index)

    return {
        "url": url,
        "title": title,
        "status": "indexed",
        "chunks": len(chunks),
        "approx_tokens": sum(len(c.text) // 4 for c in chunks),
    }

async def ingest_batch(urls: list[str], concurrency: int = 5) -> list[dict]:
    """Ingest multiple URLs with bounded concurrency."""
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded(url: str) -> dict:
        async with semaphore:
            try:
                return await ingest_url(url)
            except Exception as e:
                return {"url": url, "status": "error", "error": str(e)}

    return await asyncio.gather(*[bounded(u) for u in urls])

if __name__ == "__main__":
    urls = [
        "https://docs.python.org/3/library/asyncio-task.html",
        "https://platform.openai.com/docs/guides/embeddings",
        "https://www.pinecone.io/docs/upsert-data/",
    ]
    results = asyncio.run(ingest_batch(urls, concurrency=3))
    for r in results:
        print(r)

Handling Edge Cases

Deduplication

The same content appears under multiple URLs: www vs. bare domain, query parameters, pagination suffixes. Hash normalized text before indexing:

```python title="dedup.py" {6-13}

_seen_hashes: set[str] = set()

def is_duplicate(text: str) -> bool:
"""Return True if this exact content has already been indexed this run."""
digest = hashlib.sha256(text.encode()).hexdigest()[:16]
if digest in _seen_hashes:
return True
_seen_hashes.add(digest)
return False




Call `is_duplicate(clean_text)` after Stage 3 and skip to the next URL if it returns `True`.

### Pagination and Crawling

For documentation sites spanning dozens of pages, discover internal links before ingesting. A simple same-domain BFS over `<a href>` tags prevents you from missing chapters or API reference sections. Keep a visited-URL set to avoid cycles.

### Retries with Backoff

Your embedding API has rate limits even when your scraper does not. Wrap async calls in exponential backoff:



```python title="retry.py" {6-16}

from typing import TypeVar, Callable, Awaitable

T = TypeVar("T")

async def with_retry(fn: Callable[[], Awaitable[T]], attempts: int = 3) -> T:
    for i in range(attempts):
        try:
            return await fn()
        except Exception:
            if i == attempts - 1:
                raise
            await asyncio.sleep(min(2 ** i, 30))
    raise RuntimeError("unreachable")

Wrap index_chunks calls: await with_retry(lambda: index_chunks(chunks, index)).

Production Checklist

Before running this at scale, verify:

Freshness TTL: Set an expiry on indexed documents. Re-scrape on a schedule. Stale RAG context is worse than no context—your LLM will confidently cite outdated information.
Minimum chunk length: Filter out chunks with fewer than 15 words. Stubs from tables or code snippets without context are noise at query time.
Metadata completeness: Always store scraped_at, source_url, and section_title in vector metadata. Your LLM needs these to generate citations users can verify.
Extraction failure rate: Monitor the share of URLs returning no_content. Above 5% means your source sites have unusual structure and need custom extraction rules.
Concurrency limits: Do not set concurrency above what your scraping tier supports. Queue excess work with Redis or a task runner rather than hammering with retries.

Takeaway

A five-stage pipeline—fetch, extract, normalize, chunk, embed—is not over-engineering. It is the minimum required to produce input that a retrieval system can actually use.

Token waste is a symptom, not the root problem. The root problem is that HTML is a rendering format, not a content format. Every step in this pipeline exists to close that gap.

The fetching layer is where most teams cut corners and regret it. Flaky HTML from failed bot-bypass attempts or unrendered JS propagates bad data through every downstream stage. Eliminating that variable with a dedicated scraping API means your engineering time goes where it compounds: extraction heuristics, chunking strategy, and retrieval evaluation.