Extracting Clean Markdown from Any URL: The PageBolt /extract Endpoint
You're building an AI agent. Your agent needs to read a web page and understand it. So you do what everyone does: you pass the raw HTML to your LLM.
The problem: raw HTML is noise. It's full of scripts, ads, analytics, navigation menus, footers, and junk. Your LLM has to parse through 50KB of garbage to find 2KB of actual content. You're burning tokens and context.
There's a better way: extract the page as clean Markdown.
The Problem: HTML Noise
When you feed raw HTML to an LLM, you're giving it:
- Scripts and stylesheets (ignored)
- Navigation menus (ignored)
- Ads and tracking pixels (ignored)
- 10KB of boilerplate (wasted tokens)
- 2KB of actual content (what you need)
Your agent pays for all 50KB but can only use 2KB. That's 96% waste.
The Solution: /extract Endpoint
PageBolt's /extract endpoint does one thing: take a URL, extract the main content, convert it to clean Markdown, and return it.
const response = await fetch('https://api.pagebolt.com/v1/extract', {
method: 'POST',
headers: {
'Authorization': `Bearer YOUR_API_KEY`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: 'https://example.com/blog/article'
})
});
const data = await response.json();
console.log(data.markdown);
// Returns: # Article Title
// Article content in clean Markdown...
That's it. Three lines. The URL becomes Markdown.
Real Example: AI Agent Reading Research Papers
Let's say you're building an AI agent that summarizes research. You feed it URLs. Your agent needs to extract the content and understand it.
const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic();
async function summarizeResearchPaper(paperUrl) {
// Extract the paper as Markdown
const extractResponse = await fetch('https://api.pagebolt.com/v1/extract', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PAGEBOLT_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: paperUrl,
format: 'markdown'
})
});
const { markdown, title, author } = await extractResponse.json();
// Pass clean Markdown to Claude
const message = await client.messages.create({
model: 'claude-opus-4-5',
max_tokens: 1024,
messages: [
{
role: 'user',
content: `Summarize this research paper in 3 bullet points:
Title: ${title}
Author: ${author}
Content:
${markdown}`
}
]
});
return message.content[0].text;
}
// Usage
const summary = await summarizeResearchPaper('https://arxiv.org/pdf/2406.12345');
console.log(summary);
Why This Matters for AI Agents
When your agent processes web content, token efficiency directly impacts cost and speed:
Without /extract (raw HTML):
- Input tokens: 50,000 (full HTML)
- Output tokens: 500 (summary)
- Cost: ~$1.50 per request
With /extract (clean Markdown):
- Input tokens: 2,000 (Markdown only)
- Output tokens: 500 (summary)
- Cost: ~$0.06 per request
25x cheaper. Same output.
Plus, the LLM understands Markdown better than raw HTML. Accuracy improves.
Use Cases
Research aggregator: Extract from 100 research papers, summarize trends
Competitive intelligence: Extract competitor web pages, feed to analysis agent
Documentation agent: Extract API docs from URLs, answer questions about them
News digest: Extract articles, summarize daily news for users
Content curator: Extract blog posts, categorize by topic
Customer support: Extract help docs, train support agent on current docs
What /extract Returns
{
"markdown": "# Article Title\n\nArticle content...",
"title": "Article Title",
"author": "Author Name",
"published_date": "2026-03-18",
"word_count": 1200,
"estimated_reading_time_minutes": 5
}
Everything you need to feed your agent context + understand the source.
Cost
- Starter: $29/month (500 extractions)
- Growth: $79/month (5,000 extractions)
- Scale: $199/month (50,000 extractions)
For agents processing web content regularly, /extract is the most token-efficient way to feed your LLM real-world data.
Getting Started
- Sign up at pagebolt.dev/pricing
- Get your API key
- Make a POST request to
/extractwith a URL - Use the Markdown in your agent
Your AI agent now has a direct pipeline from URLs to clean, LLM-friendly content. No HTML parsing. No noise. Just the data your agent needs.
Start free: pagebolt.dev/pricing. 100 extractions/month, no credit card required.
Top comments (0)