Scrape raw HTML with AgentQL's REST API

#scraping #data

Sometimes you need to extract data from HTML but you don't have a URL to pass to AgentQL's REST API. Fret not: now our REST API endpoint supports querying directly from raw HTML!

With this functionality, you can scrape data from pages even if you're working behind a firewall, fetching pages with a custom crawler, or integrating with internal tools. Pass the HTML as a string and your AgentQL query, and AgentQL will return structured data in JSON.

You asked for it: scraping web pages without a URL

You asked if it was possible to scrape data without Playwright. You told us you were already fetching HTML using custom crawlers. We heard you! This new capability is perfect for querying data from:

Private and internal network pages
Previously crawled pages and HTML dumps
Archived HTML files and snapshots

It can even be used to scrape difficult-to-reach and heavily anti-botted pages. You can navigate to the page using a stealth crawler or your own browser, save the page's HTML or copy it as a string, and follow the steps below!

How to extract data from an HTML string

You can pass HTML directly in your API request like so:

curl -L 'https://api.agentql.com/v1/query-data' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <YOUR-API-KEY>' \
-d '{
  "html": "<!DOCTYPE html><html><body><h1>Main Page</h1></body></html>",
  "query": "{ heading }"
}'

AgentQL will process the HTML and return structured JSON:

{
  "heading": "Main Page"
}

Got a large, unwieldy chunk of HTML? Or a local file(s) you want to send without the copy-pasting all the HTML every time? Most HTML is going to run into JSON formatting errors if you pass it through raw, anyway. Try this out:

curl -L 'https://api.agentql.com/v1/query-data' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <YOUR-API-KEY>' \
-d "$(jq -n \
 --arg html "$(cat blog.html)" \
 '{query: "{ heading }", html: $html}'
)"

This combines reading the file with cat alongside jq's power to properly format HTML for a JSON context (escaping double quotes, etc).

Get started extracting data with HTML and AgentQL

This feature is available now—no opt-in or special flag required. Learn more in our guide to getting data from HTML with AgentQL or the REST API Reference

If you have any questions, join our Discord, and we will help you out. We love hearing from you! Find us on X, or Bluesky, too!

—The TinyFish team building AgentQL

Built for developers, by developers.

Whether you're building a simple prototype or a business-critical product, Heroku's fully-managed platform gives you the simplest path to delivering apps quickly — using the tools and languages you already love!

Learn More