Cut Your AI Agent's Input Tokens by 95% With Smart Data Extraction
You're building an AI agent that needs to read web pages. Research papers. Competitor websites. News articles. Customer websites.
Your agent fetches the URL and sends the raw HTML to your LLM.
Here's what Claude sees:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>Article Title</title>
<script src="..."></script>
<link rel="stylesheet" href="...">
<!-- 200 lines of boilerplate -->
</head>
<body>
<nav class="navbar"><!-- 150 lines of navigation markup --></nav>
<div class="header-banner"><!-- 100 lines of header --></div>
<div class="ads"><!-- 300 lines of ad markup --></div>
<main class="content">
<article>
<h1>Article Title</h1>
<p>The actual content your agent cares about...</p>
<p>More content...</p>
</article>
</main>
<aside class="sidebar"><!-- 500 lines of sidebar widgets --></aside>
<footer><!-- 200 lines of footer --></footer>
<script><!-- analytics, tracking, ads --></script>
</body>
</html>
Out of 5,000 tokens, maybe 150 are actual content. The rest is boilerplate, ads, navigation, tracking.
Your agent pays for all 5,000 tokens to read 150 tokens of useful content.
That's a 33x cost multiplier.
The Token Economy Problem
| Source | Total Tokens | Useful Tokens | Waste | Cost/Page |
|---|---|---|---|---|
| Raw HTML | 5,000 | 150 | 4,850 (97%) | $0.075 |
| Cleaned Markdown | 200 | 150 | 50 (25%) | $0.003 |
| Savings | - | - | 97% reduction | 25x cheaper |
At 100 pages/day:
- Raw HTML: $7.50/day = $225/month
- Cleaned Markdown: $0.30/day = $9/month
Your agent saves $216/month by extracting first.
The Problem with Raw HTML
Raw HTML is noise for LLMs. Every element:
- Navigation menus
- Ads and tracking
- CSS/JavaScript references
- Meta tags and SEO markup
- Sidebar widgets
- Footer links
- Comments sections
- Related articles
None of this is content. All of it costs tokens.
When Claude reads raw HTML, it's reading:
<div class="nav-item"><a href="...">Home</a></div>
<div class="nav-item"><a href="...">About</a></div>
<div class="nav-item"><a href="...">Products</a></div>
...
Instead of:
# Article Title
This is the actual article content...
Same information. 95% fewer tokens.
The Solution: Smart Data Extraction
PageBolt's /extract endpoint reads a URL and returns clean, AI-ready Markdown:
const response = await fetch('https://api.pagebolt.dev/v1/extract', {
method: 'POST',
headers: { 'Authorization': 'Bearer YOUR_API_KEY' },
body: JSON.stringify({ url: 'https://example.com/article' })
});
const { content } = await response.json();
// Returns: Clean Markdown, no HTML boilerplate
const agentResponse = await claude.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
system: 'You are a research agent. Summarize the provided article.',
messages: [
{ role: 'user', content: content } // Clean Markdown, not raw HTML
]
});
The endpoint:
- Removes navigation, ads, sidebars, footers
- Extracts only actual content
- Converts to clean Markdown
- Returns structured data ready for AI
Real-World Agent Scenarios
Scenario 1: Competitive Intelligence Agent
Monitor 20 competitor websites daily, extract key updates, feed to Claude for analysis.
Without extraction (raw HTML):
- 20 sites × 5,000 tokens = 100,000 tokens/day
- Claude input cost: $1.50/day = $45/month
- Plus: API calls to Claude × 20 = $5+/day
- Total: $45-200/month
With /extract endpoint:
- 20 sites × 200 tokens (cleaned) = 4,000 tokens/day
- Claude input cost: $0.06/day = $1.80/month
- API calls: 20 extracts = $0.10/month
- Total: $2/month
Monthly savings: $43-198
Scenario 2: Research Paper Summarization Agent
Your agent reads 100 research papers/month from arXiv, PDFs, HTML pages. Extracts key findings. Summarizes.
Without extraction:
- 100 papers × 4,000 tokens (average) = 400,000 tokens
- Input cost: $6/month
- Extraction infrastructure: Build it yourself (10+ hours)
With /extract endpoint:
- 100 papers × 300 tokens (cleaned markdown) = 30,000 tokens
- Input cost: $0.45/month
- Extraction: Fully managed, 2 lines of code
Savings: $5.55/month + 10 hours dev time
Scenario 3: Customer Support Agent
Your support agent reads customer websites to understand their business, then provides tailored advice.
Per customer interaction:
- Raw HTML: 5,000 tokens × $0.01/1K = $0.05 per query
- 100 queries/day = $5/day
- Monthly: $150/month
With /extract:
- Cleaned Markdown: 200 tokens × $0.01/1K = $0.002 per query
- 100 queries/day = $0.20/day
- Monthly: $6/month
Savings: $144/month
The Real Cost of Raw HTML
Raw HTML costs your agent:
- Token waste — 97% of input tokens are boilerplate
- Slower processing — More tokens = longer latency
- Lower quality — LLM has to filter noise before reasoning
- Higher cost — Input tokens are expensive with complex APIs
Claude's pricing:
- Input: $3 per 1M tokens
- Output: $15 per 1M tokens
Every 1,000 token reduction in input = $3 savings per query.
If your agent makes 1,000 queries/month:
- Raw HTML waste: 4,850 tokens × 1,000 = 4,850,000 wasted tokens
- Cost of waste: $14.55/month
With 10 agents:
- Total waste cost: $145.50/month
How to Integrate /extract Into Your Agent
LangChain Example
from langchain.tools import Tool
import anthropic
def extract_and_summarize(url):
# Extract clean content
response = requests.post('https://api.pagebolt.dev/v1/extract',
json={'url': url},
headers={'Authorization': f'Bearer {PAGEBOLT_KEY}'}
)
content = response.json()['content']
# Feed to Claude
message = anthropic.Anthropic().messages.create(
model='claude-3-5-sonnet-20241022',
max_tokens=1024,
messages=[{'role': 'user', 'content': content}]
)
return message.content[0].text
tool = Tool(
name='extract_and_summarize',
func=extract_and_summarize,
description='Extract clean content from URL and summarize'
)
CrewAI Example
from crewai import Tool
import requests
extract_tool = Tool(
name='Extract Page Content',
func=lambda url: requests.post(
'https://api.pagebolt.dev/v1/extract',
json={'url': url},
headers={'Authorization': f'Bearer {PAGEBOLT_KEY}'}
).json()['content'],
description='Extract clean Markdown from any URL'
)
agent.tools.append(extract_tool)
Pricing
| Plan | Extractions | Cost | Per Extract |
|---|---|---|---|
| Free | 100/mo | $0 | Free |
| Hobby | 500/mo | $9 | $0.018 |
| Starter | 5,000/mo | $29 | $0.006 |
| Pro | 50,000/mo | $99 | $0.002 |
At 1,000 extractions/month:
- PageBolt: $29/month
- Savings vs raw HTML tokens: $100-200/month
- Net gain: $71-171/month
Getting Started
- Sign up at pagebolt.dev/pricing
- Get API key (60 seconds)
- Add
/extractto your agent - Watch input tokens drop 95%
Your agent now reads clean content, costs 25x less, and makes better decisions.
Start extracting: pagebolt.dev/pricing — 100 extractions free. $9/month for 500.
Top comments (0)