Ethan Cole

Posted on May 8

How to Give Your AI Agent Live Web Access Without Feeding It Raw HTML

#ai #api #productivity #webscraping

Most AI agents eventually run into the same awkward problem: the web is right there, but reading it cleanly is still annoying.

Your agent can plan tasks, write code, summarize text, call tools, and reason through multi-step workflows. Then someone gives it a URL and the nice abstraction gets messy fast.

How should the agent actually read the page?

Raw HTML is noisy. Modern websites render content with JavaScript. Pages have navigation, cookie banners, modals, ads, related posts, footers, and a surprising number of things that are technically text but absolutely not useful context.

If you dump all of that into an LLM, you burn tokens and usually get worse answers.

The setup I keep coming back to is:

Fetch a URL.
Convert the page into clean Markdown or structured JSON.
Pass the cleaned result to your agent as context.

I will use the Thunderbit Web Scraper API for the extraction step. The product is not really the point, though. The point is the boundary: clean the webpage first, then let the agent work with the cleaned input.

Why agents need cleaner web context

An agent might need live webpage context for all kinds of ordinary product work:

answer questions about a specific article
summarize a competitor page
extract pricing from product pages
monitor job listings
enrich a company database
collect sources for a research workflow
turn documentation pages into RAG-ready content

The first quick version usually looks like this:

const html = await fetch(url).then((res) => res.text());

It is fine for a demo. It is not much of a foundation.

The HTML might not include the rendered content. The useful text might be buried between scripts, nav links, cookie text, and layout markup. You can clean it yourself, but now your agent project has a side quest: building a web extraction pipeline.

For an agent, the best input is usually not raw HTML. It is either:

clean Markdown for reading and reasoning
structured JSON for fields the agent needs to act on

For this pattern, I mostly care about two endpoints:

Distill: URL to clean Markdown
Extract: URL plus schema to JSON or CSV

Use Distill when the agent needs to read a page. Use Extract when your app needs specific fields.

Step 1: Turn a webpage into Markdown

Start with the simplest version: take a URL and turn it into Markdown.

Here is a curl request to the Distill endpoint:

curl -X POST "https://openapi.thunderbit.com/openapi/v1/distill" \
  -H "Authorization: Bearer $THUNDERBIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article"
  }'

The response includes Markdown that is much easier for an LLM to use than raw page HTML.

In Python:

import os
import requests

API_KEY = os.environ["THUNDERBIT_API_KEY"]

response = requests.post(
    "https://openapi.thunderbit.com/openapi/v1/distill",
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    },
    json={"url": "https://example.com/article"},
    timeout=60,
)

response.raise_for_status()
result = response.json()

markdown = result["data"]["markdown"]
print(markdown[:1000])

Now the agent does not need to parse HTML, ignore nav bars, or guess which chunk of the page matters. It gets the cleaned content directly.

Step 2: Give the Markdown to your agent

The prompt can stay plain:

You are helping analyze a webpage.

Use the webpage content below as your source of truth.
If the answer is not supported by the content, say so.

WEBPAGE:
{{markdown}}

USER QUESTION:
{{question}}

That "source of truth" line is doing real work. It keeps the answer grounded in the fetched page instead of letting the model blend page content with whatever it already knows.

In a real app, you might wrap this in a function:

def build_page_context_prompt(markdown: str, question: str) -> str:
    return f"""
You are helping analyze a webpage.

Use the webpage content below as your source of truth.
If the answer is not supported by the content, say so.

WEBPAGE:
{markdown}

USER QUESTION:
{question}
""".strip()

For a lot of small workflows, this gets you surprisingly far:

"Summarize this article in five bullets."
"What are the pricing tiers on this page?"
"Does this documentation mention webhooks?"
"Extract the integration steps from this guide."

Step 3: Use structured extraction when the agent needs fields

Markdown is good when the agent needs to understand a page. Sometimes the app needs fields it can act on.

For example:

product name and price
job title and location
company name and description
article title, author, and date
event name, date, venue, and registration link

When I care about those fields, I reach for schema-based extraction.

Instead of asking the agent to read a big page and then pull fields out of prose, ask the extraction layer to return structured JSON.

curl -X POST "https://openapi.thunderbit.com/openapi/v1/extract" \
  -H "Authorization: Bearer $THUNDERBIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "schema": {
      "type": "object",
      "properties": {
        "name": {
          "type": "string",
          "description": "The product name"
        },
        "price": {
          "type": "string",
          "description": "The current displayed price, including currency"
        },
        "availability": {
          "type": "string",
          "description": "Whether the product is in stock, unavailable, or preorder"
        }
      },
      "required": ["name", "price"]
    }
  }'

Now the next step gets a predictable object, not a wall of text.

For example:

{
  "name": "Example Product",
  "price": "$49.00",
  "availability": "In stock"
}

From there, the object can feed a monitoring workflow, a database update, a Slack notification, or another agent step.

A simple Node.js tool function

If you are building with tool calling, expose web reading as a normal tool.

Here is a minimal Node.js function using built-in fetch:

async function distillUrl(url) {
  const response = await fetch("https://openapi.thunderbit.com/openapi/v1/distill", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.THUNDERBIT_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url }),
  });

  if (!response.ok) {
    throw new Error(`Distill failed: ${response.status} ${await response.text()}`);
  }

  const result = await response.json();
  return result.data.markdown;
}

The agent can call this when the user gives it a URL:

const markdown = await distillUrl("https://example.com/article");

const prompt = `
You are analyzing a webpage.
Answer using only the webpage content below.

WEBPAGE:
${markdown}

QUESTION:
What are the main takeaways?
`;

This code is intentionally boring. The useful part is the boundary: the agent asks for a URL, the tool returns readable Markdown.

A tool function for structured extraction

You can also expose an extraction tool:

async function extractFromUrl(url, schema) {
  const response = await fetch("https://openapi.thunderbit.com/openapi/v1/extract", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.THUNDERBIT_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, schema }),
  });

  if (!response.ok) {
    throw new Error(`Extract failed: ${response.status} ${await response.text()}`);
  }

  return response.json();
}

Then define schemas for the objects your app understands.

For a job listing:

const jobSchema = {
  type: "object",
  properties: {
    title: {
      type: "string",
      description: "The job title",
    },
    company: {
      type: "string",
      description: "The hiring company",
    },
    location: {
      type: "string",
      description: "The job location or remote policy",
    },
    salary: {
      type: "string",
      description: "The listed salary range, if available",
    },
    requirements: {
      type: "array",
      description: "Key candidate requirements",
      items: { type: "string" },
    },
    applyUrl: {
      type: "string",
      description: "The application URL, if visible",
    },
  },
  required: ["title", "company"],
};

For most downstream code, that beats making the agent inspect the whole page every time.

Distill vs Extract

My rule of thumb:

Use Distill when the task is reading-heavy:

summarize this page
answer questions from this article
ingest docs into a knowledge base
compare two landing pages
create notes from a report

Use Extract when the task is field-heavy:

get the product price
pull all job listings
extract event details
convert a directory page into rows
enrich a CRM record

In real workflows, you may use both. Distill gives the agent broad context. Extract gives the system reliable fields.

Why not just let the LLM browse?

If your platform already has built-in browsing, that can be useful for general research. Product features usually need something more controlled:

stable API calls
predictable output shape
server-side API key management
batch processing
retries and error handling
logs for debugging
clean content that can be stored or embedded

When this is part of a product, "the model browsed somewhere and said a thing" is hard to debug. A repeatable pipeline is much easier to reason about.

That is why I prefer to separate the pieces:

Web extraction API fetches and cleans the page.
Your app validates and stores the result.
The LLM reasons over the cleaned content.

This gives you logs, retry points, validation points, and fewer mystery failures.

A few practical tips

Keep API keys server-side. Do not put your Thunderbit API key in client-side JavaScript.

Cache page reads when possible. If ten users ask about the same URL, you probably do not need to distill it ten times in five minutes.

Store the source URL with every result. When your agent gives an answer, you want to know which page it used.

Validate structured extraction before taking action. If a field is required for a workflow, check it before sending emails, updating records, or triggering automations.

Use retries for temporary failures. Timeouts, rate limits, and transient server errors should be handled differently from invalid URLs or invalid schemas.

Respect the sites you access. Follow applicable laws, terms, robots policies where relevant, and use reasonable request patterns.

Example use cases

A few places where I would use this:

Research assistant

A user gives an article URL. Your app distills it into Markdown, then the agent summarizes it, extracts claims, and suggests follow-up questions.

Sales enrichment

The user enters a company website. Your app extracts company name, positioning, target audience, product categories, and contact links, then your agent drafts a personalized outreach note.

Competitive monitoring

Your app checks competitor pricing pages on a schedule. Extract returns structured pricing data. The agent summarizes changes and highlights anything important.

Documentation helper

Your app distills docs pages into Markdown and stores them in a vector database. The agent answers support questions from up-to-date docs instead of stale model memory.

Job board tracker

Your app extracts jobs from multiple company career pages using one schema. The agent ranks matches for a candidate profile.

Final thoughts

Giving an AI agent live web access sounds bigger than it has to be.

Do not make the agent fight the webpage.

Give it clean Markdown when it needs to read. Give it structured JSON when it needs fields. Keep the messy parts of web extraction behind a tool boundary.

A scraper API is useful here because it gives you that boundary: URL in, Markdown or JSON out. Thunderbit's Web Scraper API does that with Distill for Markdown and Extract for schema-based JSON. It also handles JavaScript-heavy pages and batch workflows, which are the parts I would rather not rebuild for every agent project.

You can get an API key and try it here: Thunderbit Web Scraper API

Start with one tool: read_url(url). Once your agent can reliably read a page, a lot of web-aware workflows become easier to build.

DEV Community