Custodia-Admin

Posted on Mar 19 • Edited on Mar 25 • Originally published at pagebolt.dev

Extracting Clean Markdown from Any URL: The PageBolt /extract Endpoint

#aiagents #markdown #llm #api

Extracting Clean Markdown from Any URL: The PageBolt /extract Endpoint

You're building an AI agent. Your agent needs to read a web page and understand it. So you do what everyone does: you pass the raw HTML to your LLM.

The problem: raw HTML is noise. It's full of scripts, ads, analytics, navigation menus, footers, and junk. Your LLM has to parse through 50KB of garbage to find 2KB of actual content. You're burning tokens and context.

There's a better way: extract the page as clean Markdown.

The Problem: HTML Noise

When you feed raw HTML to an LLM, you're giving it:

Scripts and stylesheets (ignored)
Navigation menus (ignored)
Ads and tracking pixels (ignored)
10KB of boilerplate (wasted tokens)
2KB of actual content (what you need)

Your agent pays for all 50KB but can only use 2KB. That's 96% waste.

The Solution: /extract Endpoint

PageBolt's /extract endpoint does one thing: take a URL, extract the main content, convert it to clean Markdown, and return it.

const response = await fetch('https://api.pagebolt.com/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer YOUR_API_KEY`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://example.com/blog/article'
  })
});

const data = await response.json();
console.log(data.markdown);
// Returns: # Article Title
//          Article content in clean Markdown...

That's it. Three lines. The URL becomes Markdown.

Real Example: AI Agent Reading Research Papers

Let's say you're building an AI agent that summarizes research. You feed it URLs. Your agent needs to extract the content and understand it.

const Anthropic = require('@anthropic-ai/sdk');

const client = new Anthropic();

async function summarizeResearchPaper(paperUrl) {
  // Extract the paper as Markdown
  const extractResponse = await fetch('https://api.pagebolt.com/v1/extract', {
    method: 'POST',
    headers: {
      'x-api-key': `${process.env.PAGEBOLT_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      url: paperUrl,
      format: 'markdown'
    })
  });

  const { markdown, title, author } = await extractResponse.json();

  // Pass clean Markdown to Claude
  const message = await client.messages.create({
    model: 'claude-opus-4-5',
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: `Summarize this research paper in 3 bullet points:

Title: ${title}
Author: ${author}

Content:
${markdown}`
      }
    ]
  });

  return message.content[0].text;
}

// Usage
const summary = await summarizeResearchPaper('https://arxiv.org/pdf/2406.12345');
console.log(summary);

Why This Matters for AI Agents

When your agent processes web content, token efficiency directly impacts cost and speed:

Without /extract (raw HTML):

Input tokens: 50,000 (full HTML)
Output tokens: 500 (summary)
Cost: ~$1.50 per request

With /extract (clean Markdown):

Input tokens: 2,000 (Markdown only)
Output tokens: 500 (summary)
Cost: ~$0.06 per request

25x cheaper. Same output.

Plus, the LLM understands Markdown better than raw HTML. Accuracy improves.

Use Cases

Research aggregator: Extract from 100 research papers, summarize trends

Competitive intelligence: Extract competitor web pages, feed to analysis agent

Documentation agent: Extract API docs from URLs, answer questions about them

News digest: Extract articles, summarize daily news for users

Content curator: Extract blog posts, categorize by topic

Customer support: Extract help docs, train support agent on current docs

What /extract Returns

{
  "markdown": "# Article Title\n\nArticle content...",
  "title": "Article Title",
  "author": "Author Name",
  "published_date": "2026-03-18",
  "word_count": 1200,
  "estimated_reading_time_minutes": 5
}

Everything you need to feed your agent context + understand the source.

Cost

Starter: $29/month (500 extractions)
Growth: $79/month (5,000 extractions)
Scale: $199/month (50,000 extractions)

For agents processing web content regularly, /extract is the most token-efficient way to feed your LLM real-world data.

Getting Started

Sign up at pagebolt.dev/pricing
Get your API key
Make a POST request to /extract with a URL
Use the Markdown in your agent

Your AI agent now has a direct pipeline from URLs to clean, LLM-friendly content. No HTML parsing. No noise. Just the data your agent needs.

Start free: pagebolt.dev/pricing. 100 extractions/month, no credit card required.

DEV Community

Extracting Clean Markdown from Any URL: The PageBolt /extract Endpoint

Extracting Clean Markdown from Any URL: The PageBolt /extract Endpoint

The Problem: HTML Noise

The Solution: /extract Endpoint

Real Example: AI Agent Reading Research Papers

Why This Matters for AI Agents

Use Cases

What /extract Returns

Cost

Getting Started

Top comments (0)