DEV Community

Custodia-Admin
Custodia-Admin

Posted on • Originally published at pagebolt.dev

Extracting Clean Markdown from Any URL: The PageBolt /extract Endpoint

Extracting Clean Markdown from Any URL: The PageBolt /extract Endpoint

You're building an AI agent. Your agent needs to read a web page and understand it. So you do what everyone does: you pass the raw HTML to your LLM.

The problem: raw HTML is noise. It's full of scripts, ads, analytics, navigation menus, footers, and junk. Your LLM has to parse through 50KB of garbage to find 2KB of actual content. You're burning tokens and context.

There's a better way: extract the page as clean Markdown.

The Problem: HTML Noise

When you feed raw HTML to an LLM, you're giving it:

  • Scripts and stylesheets (ignored)
  • Navigation menus (ignored)
  • Ads and tracking pixels (ignored)
  • 10KB of boilerplate (wasted tokens)
  • 2KB of actual content (what you need)

Your agent pays for all 50KB but can only use 2KB. That's 96% waste.

The Solution: /extract Endpoint

PageBolt's /extract endpoint does one thing: take a URL, extract the main content, convert it to clean Markdown, and return it.

const response = await fetch('https://api.pagebolt.com/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer YOUR_API_KEY`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://example.com/blog/article'
  })
});

const data = await response.json();
console.log(data.markdown);
// Returns: # Article Title
//          Article content in clean Markdown...
Enter fullscreen mode Exit fullscreen mode

That's it. Three lines. The URL becomes Markdown.

Real Example: AI Agent Reading Research Papers

Let's say you're building an AI agent that summarizes research. You feed it URLs. Your agent needs to extract the content and understand it.

const Anthropic = require('@anthropic-ai/sdk');

const client = new Anthropic();

async function summarizeResearchPaper(paperUrl) {
  // Extract the paper as Markdown
  const extractResponse = await fetch('https://api.pagebolt.com/v1/extract', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.PAGEBOLT_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      url: paperUrl,
      format: 'markdown'
    })
  });

  const { markdown, title, author } = await extractResponse.json();

  // Pass clean Markdown to Claude
  const message = await client.messages.create({
    model: 'claude-opus-4-5',
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: `Summarize this research paper in 3 bullet points:

Title: ${title}
Author: ${author}

Content:
${markdown}`
      }
    ]
  });

  return message.content[0].text;
}

// Usage
const summary = await summarizeResearchPaper('https://arxiv.org/pdf/2406.12345');
console.log(summary);
Enter fullscreen mode Exit fullscreen mode

Why This Matters for AI Agents

When your agent processes web content, token efficiency directly impacts cost and speed:

Without /extract (raw HTML):

  • Input tokens: 50,000 (full HTML)
  • Output tokens: 500 (summary)
  • Cost: ~$1.50 per request

With /extract (clean Markdown):

  • Input tokens: 2,000 (Markdown only)
  • Output tokens: 500 (summary)
  • Cost: ~$0.06 per request

25x cheaper. Same output.

Plus, the LLM understands Markdown better than raw HTML. Accuracy improves.

Use Cases

Research aggregator: Extract from 100 research papers, summarize trends

Competitive intelligence: Extract competitor web pages, feed to analysis agent

Documentation agent: Extract API docs from URLs, answer questions about them

News digest: Extract articles, summarize daily news for users

Content curator: Extract blog posts, categorize by topic

Customer support: Extract help docs, train support agent on current docs

What /extract Returns

{
  "markdown": "# Article Title\n\nArticle content...",
  "title": "Article Title",
  "author": "Author Name",
  "published_date": "2026-03-18",
  "word_count": 1200,
  "estimated_reading_time_minutes": 5
}
Enter fullscreen mode Exit fullscreen mode

Everything you need to feed your agent context + understand the source.

Cost

  • Starter: $29/month (500 extractions)
  • Growth: $79/month (5,000 extractions)
  • Scale: $199/month (50,000 extractions)

For agents processing web content regularly, /extract is the most token-efficient way to feed your LLM real-world data.

Getting Started

  1. Sign up at pagebolt.dev/pricing
  2. Get your API key
  3. Make a POST request to /extract with a URL
  4. Use the Markdown in your agent

Your AI agent now has a direct pipeline from URLs to clean, LLM-friendly content. No HTML parsing. No noise. Just the data your agent needs.

Start free: pagebolt.dev/pricing. 100 extractions/month, no credit card required.

Top comments (0)