DEV Community

AlterLab
AlterLab

Posted on • Originally published at alterlab.io

Web Scraping Pipelines for AI Agents: Cut Token Waste

Build the Pipeline First, Optimize the Prompt Second

Fetch rendered HTML, strip the noise, convert to markdown or typed JSON, then pass clean content to your agent. Done in that order, this pipeline cuts per-page token costs by 10–50x and eliminates hallucinations caused by LLMs trying to parse navigation menus.

The architecture is straightforward. The implementation details—waiting for DOM mutations, isolating content zones, handling JS-heavy SPAs at scale—are where most pipelines break down. This post covers each step with working Python code.


The Token Math

A typical SaaS pricing page:

  • Raw HTML (scripts, styles, nav, footer included): ~110KB → ~27,000 tokens
  • Article body as clean markdown: ~4KB → ~1,000 tokens

At GPT-4o pricing, scraping 50,000 pages daily costs:

  • Raw HTML pipeline: ~$3,375/day
  • Clean extraction pipeline: ~$125/day

That delta compounds. At 50,000 pages/day, you're paying for 1.35 billion unnecessary tokens. That's not a prompt engineering problem—it's a data pipeline problem.


Pipeline Architecture

The right extraction strategy depends on your agent's consumption pattern:

  • Research / RAG agents — need document-scoped markdown with source metadata
  • Structured-data agents (price monitors, job scrapers) — need typed JSON with explicit field names
  • Classification / summarization agents — need content-zone markdown, no navigation, no related links

The fetch layer is identical for all three. Extraction diverges at step 3.


Step 1: Fetch Pages That Actually Render

Most high-value pages—pricing tables, product listings, paywalled articles—are JavaScript-rendered and protected by Cloudflare or similar. You need both JS execution and anti-bot bypass in a single request.

```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://shop.example.com/product/12345",
"render_js": true,
"wait_for": ".product-price",
"extract_markdown": true
}'




The `wait_for` selector blocks until the target element is present in the DOM. For React and Vue SPAs where content mounts asynchronously, this is more reliable than any fixed sleep duration.

The Python SDK call is equivalent:



```python title="fetch.py" {7-14}

client = alterlab.Client(api_key="YOUR_API_KEY")

def fetch_page(url: str, wait_for_selector: str | None = None) -> dict:
    """Fetch a JS-rendered page and return structured result."""
    response = client.scrape(
        url=url,
        render_js=True,
        wait_for=wait_for_selector,
        extract_markdown=True,
        timeout=30,
    )

    return {
        "url": url,
        "markdown": response.markdown,
        "title": response.metadata.title,
        "fetched_at": response.metadata.timestamp,
        "status_code": response.status_code,
    }

result = fetch_page(
    "https://shop.example.com/product/12345",
    wait_for_selector=".product-price",
)
print(f"Content: {len(result['markdown'])} chars")
Enter fullscreen mode Exit fullscreen mode

The highlighted block is the entire fetch call. Proxy rotation, anti-bot bypass, and browser lifecycle management happen inside the SDK—none of that lives in your pipeline code.


Step 2: Extract Signal From Noise

Even with extract_markdown=True, you'll receive navigation links, sidebar widgets, and related-content carousels mixed with the main body. The second pass isolates the content zone.

Strip noise before running content selectors—order matters. Decomposing footer and nav elements first prevents them from being selected as the "main content" when semantic containers are absent.

```python title="extract.py" {28-48}
from bs4 import BeautifulSoup

from dataclasses import dataclass
from typing import Optional

@dataclass
class ExtractedPage:
url: str
title: str
body_markdown: str
word_count: int
content_hash: str
fetched_at: str

Ordered by specificity — stop at first match

CONTENT_SELECTORS = [
"article",
"main",
"[role='main']",
".post-content",
".article-body",
".entry-content",
"#content",
]

NOISE_SELECTORS = [
"nav", "header", "footer", "aside",
".sidebar", ".related-posts", ".ads", ".cookie-banner",
"script", "style", "noscript", "[aria-hidden='true']",
]

def extract_content(html: str, url: str, title: str, fetched_at: str) -> ExtractedPage:
soup = BeautifulSoup(html, "lxml")

# 1. Remove noise elements first
for selector in NOISE_SELECTORS:
    for el in soup.select(selector):
        el.decompose()

# 2. Find main content zone
content_el = None
for selector in CONTENT_SELECTORS:
    content_el = soup.select_one(selector)
    if content_el:
        break

target = content_el or soup.body or soup

# 3. Convert to markdown, strip media
raw_md = markdownify.markdownify(
    str(target),
    heading_style="ATX",
    strip=["img", "iframe", "video"],
    newline_style="backslash",
)

# Collapse runs of blank lines
body_md = re.sub(r"\n{3,}", "\n\n", raw_md).strip()

return ExtractedPage(
    url=url,
    title=title,
    body_markdown=body_md,
    word_count=len(body_md.split()),
    content_hash=hashlib.sha256(body_md.encode()).hexdigest()[:12],
    fetched_at=fetched_at,
)
Enter fullscreen mode Exit fullscreen mode



The highlighted block is the three-pass extraction: decompose noise, select content zone, convert to markdown. The `content_hash` field gives you deduplication and change detection for free.

<div data-infographic="try-it" data-url="https://alterlab.io/docs" data-description="Try scraping this page with AlterLab — compare the raw HTML token count against the extracted markdown output" />

---

## Step 3: Structured JSON for Data-Oriented Agents

For agents that act on values—not prose—bypass markdown entirely and extract typed fields. A Pydantic model as the extraction target gives you validation, serialization, and a schema your agent can reference in its system prompt.



```python title="structured_extract.py"
from pydantic import BaseModel
from typing import Optional
from bs4 import BeautifulSoup

class ProductListing(BaseModel):
    name: str
    price_usd: Optional[float] = None
    in_stock: bool = True
    rating: Optional[float] = None
    review_count: Optional[int] = None
    url: str

def extract_product(html: str, url: str) -> ProductListing:
    soup = BeautifulSoup(html, "lxml")

    def text(selector: str) -> str:
        el = soup.select_one(selector)
        return el.get_text(strip=True) if el else ""

    # "$1,299.00" → 1299.0
    price_raw = text("[data-price], .price, [itemprop='price'], .product-price")
    price = None
    if m := re.search(r"[\d]+\.?\d*", price_raw.replace(",", "")):
        price = float(m.group())

    # "4.5 out of 5 stars" → 4.5
    rating_raw = text("[data-rating], [itemprop='ratingValue'], .star-rating")
    rating = None
    if m := re.search(r"(\d+\.?\d*)\s*(?:out of|\/|\s*stars?)", rating_raw):
        rating = float(m.group(1))

    stock_text = text(".availability, [data-stock], .stock-status").lower()
    in_stock = "out of stock" not in stock_text

    review_raw = text("[itemprop='reviewCount'], .review-count, .ratings-count")
    review_count = int(m.group()) if (m := re.search(r"\d+", review_raw)) else None

    return ProductListing(
        name=text("h1, [itemprop='name'], .product-title"),
        price_usd=price,
        in_stock=in_stock,
        rating=rating,
        review_count=review_count,
        url=url,
    )
Enter fullscreen mode Exit fullscreen mode

Pass the result to your agent as json.dumps(product.model_dump()). The token count for a fully populated ProductListing is roughly 80–120 tokens versus 25,000+ for the raw HTML page it came from.


Step 4: Async Batching at Scale

Single-page scraping is a solved problem. Pipelines that process thousands of URLs need concurrency control, retry logic, and graceful degradation.

```python title="pipeline.py" {22-40}

from typing import AsyncGenerator

logger = logging.getLogger(name)

API_BASE = "https://api.alterlab.io/v1"
CONCURRENCY = 15
MAX_RETRIES = 3

async def scrape_one(
client: httpx.AsyncClient,
url: str,
api_key: str,
sem: asyncio.Semaphore,
) -> dict:
async with sem:
for attempt in range(MAX_RETRIES):
try:
resp = await client.post(
f"{API_BASE}/scrape",
headers={"X-API-Key": api_key},
json={
"url": url,
"render_js": True,
"extract_markdown": True,
},
timeout=45.0,
)
resp.raise_for_status()
data = resp.json()
return {"url": url, "markdown": data.get("markdown"), "status": "ok"}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
await asyncio.sleep(2 ** attempt)
continue
logger.warning("HTTP %d for %s", e.response.status_code, url)
return {"url": url, "error": str(e), "status": "error"}
except Exception as e:
if attempt == MAX_RETRIES - 1:
return {"url": url, "error": str(e), "status": "error"}
await asyncio.sleep(1)
return {"url": url, "error": "max retries exceeded", "status": "error"}

async def scrape_batch(
urls: list[str],
api_key: str,
) -> AsyncGenerator[dict, None]:
sem = asyncio.Semaphore(CONCURRENCY)
async with httpx.AsyncClient() as client:
tasks = [scrape_one(client, url, api_key, sem) for url in urls]
for coro in asyncio.as_completed(tasks):
result = await coro
if result:
yield result

Usage

async def run():
urls = ["https://example.com/page/1", "https://example.com/page/2"]
results = []

async for page in scrape_batch(urls, api_key="YOUR_API_KEY"):
    if page["status"] == "ok":
        results.append(page)
        print(f"✓  {page['url']} ({len(page['markdown'])} chars)")

print(f"\n{len(results)}/{len(urls)} pages scraped successfully")
Enter fullscreen mode Exit fullscreen mode

asyncio.run(run())




The highlighted block is the fetch loop with retry and rate-limit handling. `asyncio.as_completed` lets you process pages as they finish rather than waiting for the slowest URL in each batch.

---

## Extraction Strategy Comparison

<div data-infographic="comparison">
  <table>
    <thead>
      <tr>
        <th>Strategy</th>
        <th>Avg Tokens/Page</th>
        <th>Structured Output</th>
        <th>Best For</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Raw HTML</td>
        <td>~25,000</td>
        <td>No</td>
        <td>Nothing — avoid entirely</td>
      </tr>
      <tr>
        <td>Full-page Markdown</td>
        <td>~3,000</td>
        <td>Partial</td>
        <td>Broad research agents</td>
      </tr>
      <tr>
        <td>Content-zone Markdown</td>
        <td>~800</td>
        <td>Partial</td>
        <td>Summarization, RAG pipelines</td>
      </tr>
      <tr>
        <td>Targeted JSON Extraction</td>
        <td>~120</td>
        <td>Yes</td>
        <td>Price monitors, job scrapers, structured ETL</td>
      </tr>
    </tbody>
  </table>
</div>

---

## Feeding Clean Data Into Agent Context

The extraction pipeline is only as useful as the context it produces. Structure agent messages to include provenance so the model can cite sources and you can audit outputs.



```python title="agent_context.py"
from dataclasses import dataclass

@dataclass
class ExtractedPage:
    url: str
    title: str
    body_markdown: str
    fetched_at: str

def build_agent_messages(pages: list[ExtractedPage], query: str) -> list[dict]:
    """
    Construct a message list for the LLM.
    Each source block is clearly delimited with URL and timestamp.
    """
    source_blocks = []
    for page in pages:
        source_blocks.append(
            f"---\n"
            f"Source: {page.url}\n"
            f"Title: {page.title}\n"
            f"Fetched: {page.fetched_at}\n\n"
            f"{page.body_markdown}"
        )

    sources_text = "\n\n".join(source_blocks)

    return [
        {
            "role": "system",
            "content": (
                "You are a research assistant. Use only the provided sources "
                "to answer questions. Cite the Source URL for every factual claim."
            ),
        },
        {
            "role": "user",
            "content": f"Sources:\n\n{sources_text}\n\n---\n\nQuestion: {query}",
        },
    ]
Enter fullscreen mode Exit fullscreen mode

The delimiter pattern (---, Source:, Fetched:) gives the model explicit document boundaries. Without them, content from adjacent pages bleeds together in the context window and attribution becomes unreliable.


Takeaways

  • Never pass raw HTML to an LLM. Convert first. The token savings are 10–50x and the quality improvement is immediate.
  • Strip noise before selecting content. Decompose nav, header, footer, and script elements first—wrong ordering causes false-positive content matches.
  • Match extraction format to agent type. Prose agents need markdown. Structured agents need typed JSON. Don't use one where the other belongs.
  • wait_for selector > sleep. Waiting for a specific DOM element is more reliable than any fixed timeout when scraping JS-rendered pages.
  • Semaphore-bounded asyncio is the right concurrency model. Not thread pools, not multiprocessing—async with a concurrency ceiling.
  • Attach provenance at extraction time. URL, timestamp, and content hash belong on every extracted document. Adding them retroactively is harder than it sounds.
  • Content hash enables deduplication and change detection. If the hash matches a previous crawl, skip the LLM call entirely.

Top comments (0)