Feed Clean Web Data to RAG Pipelines Without Wasting LLM Tokens

#ai #dataextraction #api #python

How to Feed Clean Web Data to RAG Pipelines Without Wasting 90% of Your LLM Tokens

Raw HTML is the worst possible input for a RAG pipeline. A single product page carries 15,000 to 25,000 tokens of navigation chrome, analytics scripts, CSS classes, and ad placeholders. Your embedding model processes all of it. Your vector store stores all of it. Your retrieval step searches through all of it.

You pay for every token.

The fix is straightforward: extract only the content that matters before it reaches your embedding model. Strip the noise. Keep the signal. Structure it so retrieval actually works.

Here is how to build that pipeline.

The Token Math Behind Dirty Web Data

A typical e-commerce product page breaks down like this:

Product title, description, specs: ~800 tokens
Navigation menus, footer, sidebar: ~3,000 tokens
JavaScript bundles, tracking pixels, ad scripts: ~8,000 tokens
CSS class names, inline styles, layout divs: ~4,000 tokens
Schema markup, meta tags, Open Graph: ~1,200 tokens

Your RAG pipeline cares about the first line. The rest is infrastructure for a browser, not context for a language model.

When you embed raw HTML, the noise drowns out the signal. Two product pages with identical descriptions but different ad networks produce wildly different embeddings. Retrieval quality drops. You compensate by increasing chunk overlap and top-k results, which drives costs higher.

Extract clean content first. Embed only what matters.

Step 1: Get Clean Content at the Source

The most efficient place to strip noise is during extraction, not after. Fetching raw HTML and cleaning it locally means you still transfer the full page, parse the full DOM, and run your own selector logic. Doing it server-side through a scraping API cuts the work in half.

Here is the same operation using the Python SDK and a direct cURL call. Both request Markdown output instead of raw HTML.

```python title="scraper.py" {1,4-6}
from alterlab import AlterLab

client = AlterLab(api_key="YOUR_API_KEY")
response = client.scrape(
url="https://example.com/product/12345",
formats=["markdown"]
)
print(response.markdown)






```bash title="Terminal" {3-5}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -d '{
    "url": "https://example.com/product/12345",
    "formats": ["markdown"]
  }'

The response arrives as clean Markdown. No HTML tags. No script blocks. Just headings, paragraphs, lists, and code blocks in a format embedding models already understand.

For sites that require JavaScript rendering, set min_tier=3 to skip the basic HTTP fetcher and go straight to a headless browser. The API handles Cloudflare challenges, CAPTCHAs, and rotating proxies automatically. You get the rendered content without managing browser instances.

Step 2: Structure Data for Retrieval, Not Display

Markdown output works well for articles, documentation, and blog posts. But product pages, job listings, and pricing tables need structure. A flat text blob loses the relationships between fields.

Use Cortex AI extraction to pull structured data directly from the page. You describe what you want in plain English. The API returns JSON.

```python title="structured_extraction.py" {5-12}
from alterlab import AlterLab

client = AlterLab(api_key="YOUR_API_KEY")

response = client.scrape(
url="https://example.com/jobs",
cortex={
"prompt": "Extract all job listings. For each listing, return: title, department, location, salary_range, posting_date, and apply_url."
},
formats=["json"]
)

for job in response.json["listings"]:
print(f"{job['title']} - {job['location']} ({job['salary_range']})")




The JSON output maps directly to your embedding pipeline. Each job listing becomes a single document with typed fields. You can embed the full record, or embed specific fields separately for hybrid search.

Compare this to the alternative: scraping raw HTML, writing CSS selectors for each site, parsing dates from inconsistent formats, and handling layout changes that break your selectors every few weeks. Cortex handles the variation. You get consistent JSON regardless of how the page renders.

## Step 3: Chunk Strategically

Clean content solves the noise problem. Chunking strategy solves the retrieval problem.

Bad chunking cuts sentences in half. It splits tables across chunks. It separates a heading from the paragraphs it governs. Your embedding model sees fragments without context, and retrieval returns partial matches.

Good chunking respects document structure. Markdown makes this straightforward.



```python title="chunker.py" {6-15}

from typing import List

def chunk_markdown(text: str, max_tokens: int = 500) -> List[str]:
    chunks = []
    sections = re.split(r'\n## ', text)

    for section in sections:
        if not section.strip():
            continue

        heading = ""
        if "\n" in section:
            heading, body = section.split("\n", 1)
        else:
            heading, body = section, ""

        current_chunk = f"## {heading}\n" if heading else ""

        paragraphs = body.split("\n\n")
        for para in paragraphs:
            if len(current_chunk) + len(para) > max_tokens * 4:
                chunks.append(current_chunk.strip())
                current_chunk = f"## {heading}\n" if heading else ""
            current_chunk += para + "\n\n"

        if current_chunk.strip():
            chunks.append(current_chunk.strip())

    return chunks

This approach keeps headings attached to their content. It respects paragraph boundaries. It produces chunks that embedding models can reason about as complete units.

The token estimate uses a 4:1 character-to-token ratio for planning. Your embedding provider's tokenizer gives exact counts. Use that for production.

Step 4: Build the Ingestion Pipeline

Tie extraction, cleaning, chunking, and embedding together. The pipeline should handle three scenarios:

Initial index: Scrape a list of URLs, extract clean content, chunk, embed, store.
Incremental update: Monitor pages for changes. Re-extract and re-embed only what changed.
Scheduled refresh: Run on a cron to catch pages that changed without triggering monitoring alerts.

```python title="pipeline.py" {8-10,18-22}
from alterlab import AlterLab
from datetime import datetime

client = AlterLab(api_key="YOUR_API_KEY")

def ingest_page(url: str, embedding_fn):
response = client.scrape(
url=url,
formats=["markdown"],
min_tier=3
)

if not response.markdown:
    return

chunks = chunk_markdown(response.markdown)

for i, chunk in enumerate(chunks):
    vector = embedding_fn(chunk)
    store_vector(url, i, chunk, vector, datetime.utcnow())

def ingest_batch(urls: list, embedding_fn):
for url in urls:
try:
ingest_page(url, embedding_fn)
except Exception as e:
print(f"Failed {url}: {e}")




For incremental updates, use the monitoring feature. Set up watchers on your indexed URLs. When content changes, the API notifies you via webhook. You re-run `ingest_page` for that URL only. No full re-index required.



```python title="monitoring_setup.py" {4-9}
client.monitor(
    url="https://example.com/pricing",
    schedule="0 9 * * 1",
    webhook="https://your-server.com/webhooks/alterlab",
    diff=True
)

The webhook payload includes a diff showing what changed. You can decide whether the change warrants a re-embedding. A price update does. A typo fix in the footer does not.

Approach	Tokens per Page	Retrieval Quality	Maintenance
Raw HTML	15,000-25,000	Low	High (selector breaks)
Local HTML cleaning	5,000-8,000	Medium	High (DOM changes)
Server-side Markdown	1,500-3,000	High	Low (handled by API)
Cortex JSON extraction	200-800	Highest	Lowest (AI adapts)

Step 5: Handle Anti-Bot Pages Without Infrastructure

Many sites you want to index block automated requests. Cloudflare challenges, CAPTCHAs, rate limits. Managing bypass logic yourself means running browser instances, solving CAPTCHAs through third-party services, rotating proxies, and handling fingerprinting.

That infrastructure costs more than the scraping itself.

Use tiered scraping to handle this automatically. Start with a lightweight HTTP request. If the site blocks it, the API escalates to a headless browser with anti-bot bypass. You set the floor with min_tier to skip the试探 phase for sites you know are protected.

```python title="tiered_scraping.py" {4-7}
response = client.scrape(
url="https://protected-site.com/data",
min_tier=3,
formats=["markdown"]
)
print(response.status)
print(response.markdown[:500])




Tier 1 handles simple static pages. Tier 3 adds JavaScript rendering and anti-bot bypass. Tier 5 includes CAPTCHA solving. The API picks the right tier for each URL. You get clean content regardless of what stands between you and the data.

<div data-infographic="try-it" data-url="https://alterlab.io/docs" data-description="Try extracting clean Markdown from this documentation page"></div>

## Cost Breakdown

Token waste compounds across three stages of a RAG pipeline:

**Embedding**: You pay per token sent to the embedding model. Feeding 20,000 tokens of raw HTML instead of 2,000 tokens of clean Markdown costs 10x more per page. Index 10,000 pages and the difference is measurable.

**Storage**: Vector databases charge by dimension count and record volume. Storing embeddings for noise chunks wastes space. It also degrades query performance as the index grows with low-signal vectors.

**Retrieval**: Each query searches the entire index. A bloated index with noisy chunks returns worse results. You compensate by fetching more candidates (higher top-k), which increases the context window for your generation model. That costs more per query.

Clean extraction at the source addresses all three. Smaller chunks. Better embeddings. Faster retrieval. Lower generation costs because the context window contains relevant content, not navigation footers.

## When to Use Each Output Format

**Markdown**: Articles, documentation, blog posts, help centers. Any page where the content flows as prose with headings and lists. This is your default for knowledge base ingestion.

**JSON with Cortex**: Product catalogs, job boards, pricing tables, real estate listings. Any page with repeating structured elements. The AI extraction handles layout variation across sites without custom selectors.

**Plain text**: Simple pages with minimal formatting. API response pages. Status pages. Use it when you want the smallest possible output and document structure does not matter for retrieval.

**HTML**: Rarely. Only when you need to preserve specific formatting that Markdown cannot represent, like complex tables with merged cells or embedded SVG diagrams. Most RAG pipelines do not need this.

## Putting It Together

A production RAG ingestion pipeline looks like this:

1. Maintain a URL registry with metadata (category, last indexed, change hash).
2. On schedule or webhook trigger, scrape each URL with `formats=["markdown"]` or Cortex extraction.
3. Chunk the output using structure-aware splitting.
4. Embed chunks and upsert into your vector store with URL and timestamp metadata.
5. Monitor URLs for changes. Re-index only what changed.

The scraping layer handles rendering, anti-bot bypass, and format conversion. Your pipeline handles chunking, embedding, and storage. Clean separation. Each layer does one job well.

Check the [Python SDK documentation](https://alterlab.io/web-scraping-api-python) for the full API reference, including webhook configuration and scheduling options. The [quickstart guide](https://alterlab.io/docs/quickstart/installation) covers account setup and your first API call.

## Takeaway

Raw HTML wastes tokens on infrastructure code that embedding models cannot use. Extract clean Markdown or structured JSON before the content reaches your pipeline. Chunk with respect to document boundaries. Monitor for changes and re-index incrementally.

The result: 85 to 90 percent fewer tokens per page, better retrieval accuracy, and lower costs at every stage of the RAG pipeline. The scraping API handles rendering and anti-bot bypass. You handle the data.