Web Scraping Pipeline for RAG: Feed Clean Data into Your LLM Without Token Waste
Raw HTML is poison for RAG. A typical news article page is 45,000 characters—roughly 11,000 tokens. The actual article is 800 words, or about 1,100 tokens. You are paying 10× to embed navigation menus, cookie banners, footer links, and inline scripts that actively dilute your embeddings and degrade retrieval quality.
The fix is a five-stage pipeline: reliable fetch → content extraction → normalization → semantic chunking → embed and index. Each stage has a single responsibility. Each failure is isolated and debuggable. This post walks through a production implementation in Python.
Pipeline Architecture
Stage 1: Reliable Fetching
The hardest part of scraping at scale is not parsing—it is getting the HTML. Bot detection blocks requests. JavaScript-rendered SPAs return skeleton HTML to static fetches. IP ranges accumulate blocks.
AlterLab's scraping API handles this in a single POST: rotating residential proxies, automatic CAPTCHA bypass, and optional headless rendering without managing a browser fleet yourself.
Python:
```python title="fetch.py" {8-19}
ALTERLAB_API_KEY = "YOUR_API_KEY"
ALTERLAB_BASE_URL = "https://api.alterlab.io/v1"
def fetch_page(url: str, render_js: bool = False) -> str:
"""Fetch fully-rendered HTML from any URL."""
response = httpx.post(
f"{ALTERLAB_BASE_URL}/scrape",
headers={"X-API-Key": ALTERLAB_API_KEY, "Content-Type": "application/json"},
json={
"url": url,
"render_js": render_js,
"wait_for": "networkidle" if render_js else None,
},
timeout=30,
)
response.raise_for_status()
return response.json()["html"]
**cURL:**
```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/docs/getting-started",
"render_js": true,
"wait_for": "networkidle"
}'
| Model | Recommended max_tokens | Overlap Sentences | Notes |
|---|---|---|---|
| text-embedding-3-small | 400 | 2 | Good default for mixed content |
| text-embedding-3-large | 600 | 2 | Better for long-form technical docs |
| nomic-embed-text | 512 | 3 | Open-source; strong on code + prose |
| BGE-M3 | 800 | 2 | Multilingual; 8192-token context |
Stage 5: Embedding and Indexing
Batch your embedding calls. The OpenAI embeddings API accepts up to 2,048 inputs per request—sending one chunk per call is 100× slower and burns rate limit quota unnecessarily.
```python title="embed.py" {27-48}
from openai import AsyncOpenAI
openai_client = AsyncOpenAI()
def init_index(api_key: str, index_name: str) -> pinecone.Index:
pc = pinecone.Pinecone(api_key=api_key)
return pc.Index(index_name)
async def embed_texts(texts: list[str]) -> list[list[float]]:
"""Batch embed up to 2048 texts in a single API call."""
response = await openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
encoding_format="float",
)
return [item.embedding for item in response.data]
async def index_chunks(
chunks: list["Chunk"],
index: pinecone.Index,
batch_size: int = 100,
) -> None:
"""Embed and upsert chunks into Pinecone with source metadata preserved."""
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
vectors = await embed_texts([c.text for c in batch])
upserts = [
{
"id": f"{c.url}::{c.chunk_index}",
"values": vectors[j],
"metadata": {
"url": c.url,
"chunk_index": c.chunk_index,
"total_chunks": c.total_chunks,
"text": c.text, # store inline—avoids a separate fetch at query time
},
}
for j, c in enumerate(batch)
]
index.upsert(vectors=upserts)
Store `text` in the vector metadata. Fetching the source document at query time adds latency and a failure point; paying a few extra bytes per vector is worth it.
---
## Full Pipeline
```python title="pipeline.py" {11-55}
from fetch import fetch_page
from extract import extract_content
from normalize import normalize_text
from chunker import Chunk, chunk_document
from embed import init_index, index_chunks
PINECONE_API_KEY = "YOUR_PINECONE_KEY"
PINECONE_INDEX = "rag-knowledge-base"
async def ingest_url(url: str, render_js: bool = False) -> dict:
"""
End-to-end pipeline: URL → indexed, retrievable chunks.
Returns a summary dict for logging and monitoring.
"""
# Stage 1: Fetch
html = fetch_page(url, render_js=render_js)
# Stage 2: Extract
extracted = extract_content(html, url)
raw_text = extracted.get("text", "")
title = extracted.get("title", "untitled")
if not raw_text:
return {"url": url, "status": "no_content", "chunks": 0}
# Stage 3: Normalize
clean_text = normalize_text(raw_text)
# Stage 4: Chunk
chunks = chunk_document(
text=clean_text,
url=url,
max_tokens=400,
overlap_sentences=2,
)
# Filter degenerate chunks before embedding
chunks = [c for c in chunks if len(c.text.split()) >= 15]
# Stage 5: Embed + index
index = init_index(PINECONE_API_KEY, PINECONE_INDEX)
await index_chunks(chunks, index)
return {
"url": url,
"title": title,
"status": "indexed",
"chunks": len(chunks),
"approx_tokens": sum(len(c.text) // 4 for c in chunks),
}
async def ingest_batch(urls: list[str], concurrency: int = 5) -> list[dict]:
"""Ingest multiple URLs with bounded concurrency."""
semaphore = asyncio.Semaphore(concurrency)
async def bounded(url: str) -> dict:
async with semaphore:
try:
return await ingest_url(url)
except Exception as e:
return {"url": url, "status": "error", "error": str(e)}
return await asyncio.gather(*[bounded(u) for u in urls])
if __name__ == "__main__":
urls = [
"https://docs.python.org/3/library/asyncio-task.html",
"https://platform.openai.com/docs/guides/embeddings",
"https://www.pinecone.io/docs/upsert-data/",
]
results = asyncio.run(ingest_batch(urls, concurrency=3))
for r in results:
print(r)
Handling Edge Cases
Deduplication
The same content appears under multiple URLs: www vs. bare domain, query parameters, pagination suffixes. Hash normalized text before indexing:
```python title="dedup.py" {6-13}
_seen_hashes: set[str] = set()
def is_duplicate(text: str) -> bool:
"""Return True if this exact content has already been indexed this run."""
digest = hashlib.sha256(text.encode()).hexdigest()[:16]
if digest in _seen_hashes:
return True
_seen_hashes.add(digest)
return False
Call `is_duplicate(clean_text)` after Stage 3 and skip to the next URL if it returns `True`.
### Pagination and Crawling
For documentation sites spanning dozens of pages, discover internal links before ingesting. A simple same-domain BFS over `<a href>` tags prevents you from missing chapters or API reference sections. Keep a visited-URL set to avoid cycles.
### Retries with Backoff
Your embedding API has rate limits even when your scraper does not. Wrap async calls in exponential backoff:
```python title="retry.py" {6-16}
from typing import TypeVar, Callable, Awaitable
T = TypeVar("T")
async def with_retry(fn: Callable[[], Awaitable[T]], attempts: int = 3) -> T:
for i in range(attempts):
try:
return await fn()
except Exception:
if i == attempts - 1:
raise
await asyncio.sleep(min(2 ** i, 30))
raise RuntimeError("unreachable")
Wrap index_chunks calls: await with_retry(lambda: index_chunks(chunks, index)).
Production Checklist
Before running this at scale, verify:
- Freshness TTL: Set an expiry on indexed documents. Re-scrape on a schedule. Stale RAG context is worse than no context—your LLM will confidently cite outdated information.
- Minimum chunk length: Filter out chunks with fewer than 15 words. Stubs from tables or code snippets without context are noise at query time.
-
Metadata completeness: Always store
scraped_at,source_url, andsection_titlein vector metadata. Your LLM needs these to generate citations users can verify. -
Extraction failure rate: Monitor the share of URLs returning
no_content. Above 5% means your source sites have unusual structure and need custom extraction rules. -
Concurrency limits: Do not set
concurrencyabove what your scraping tier supports. Queue excess work with Redis or a task runner rather than hammering with retries.
Takeaway
A five-stage pipeline—fetch, extract, normalize, chunk, embed—is not over-engineering. It is the minimum required to produce input that a retrieval system can actually use.
Token waste is a symptom, not the root problem. The root problem is that HTML is a rendering format, not a content format. Every step in this pipeline exists to close that gap.
The fetching layer is where most teams cut corners and regret it. Flaky HTML from failed bot-bypass attempts or unrendered JS propagates bad data through every downstream stage. Eliminating that variable with a dedicated scraping API means your engineering time goes where it compounds: extraction heuristics, chunking strategy, and retrieval evaluation.
Top comments (0)