DEV Community

Cover image for Scrape raw HTML with AgentQL's REST API
Rachel-Lee Nabors for AgentQL

Posted on • Originally published at agentql.com

2

Scrape raw HTML with AgentQL's REST API

Sometimes you need to extract data from HTML but you don't have a URL to pass to AgentQL's REST API. Fret not: now our REST API endpoint supports querying directly from raw HTML!

With this functionality, you can scrape data from pages even if you're working behind a firewall, fetching pages with a custom crawler, or integrating with internal tools. Pass the HTML as a string and your AgentQL query, and AgentQL will return structured data in JSON.

.

You asked for it: scraping web pages without a URL

You asked if it was possible to scrape data without Playwright. You told us you were already fetching HTML using custom crawlers. We heard you! This new capability is perfect for querying data from:

  • Private and internal network pages
  • Previously crawled pages and HTML dumps
  • Archived HTML files and snapshots

It can even be used to scrape difficult-to-reach and heavily anti-botted pages. You can navigate to the page using a stealth crawler or your own browser, save the page's HTML or copy it as a string, and follow the steps below!

How to extract data from an HTML string

You can pass HTML directly in your API request like so:

curl -L 'https://api.agentql.com/v1/query-data' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <YOUR-API-KEY>' \
-d '{
  "html": "<!DOCTYPE html><html><body><h1>Main Page</h1></body></html>",
  "query": "{ heading }"
}'
Enter fullscreen mode Exit fullscreen mode

AgentQL will process the HTML and return structured JSON:

{
  "heading": "Main Page"
}
Enter fullscreen mode Exit fullscreen mode

Got a large, unwieldy chunk of HTML? Or a local file(s) you want to send without the copy-pasting all the HTML every time? Most HTML is going to run into JSON formatting errors if you pass it through raw, anyway. Try this out:

curl -L 'https://api.agentql.com/v1/query-data' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <YOUR-API-KEY>' \
-d "$(jq -n \
 --arg html "$(cat blog.html)" \
 '{query: "{ heading }", html: $html}'
)"
Enter fullscreen mode Exit fullscreen mode

This combines reading the file with cat alongside jq's power to properly format HTML for a JSON context (escaping double quotes, etc).

Get started extracting data with HTML and AgentQL

This feature is available now—no opt-in or special flag required. Learn more in our guide to getting data from HTML with AgentQL or the REST API Reference

If you have any questions, join our Discord, and we will help you out. We love hearing from you! Find us on X, or Bluesky, too!

—The TinyFish team building AgentQL

Image of Timescale

📊 Benchmarking Databases for Real-Time Analytics Applications

Benchmarking Timescale, Clickhouse, Postgres, MySQL, MongoDB, and DuckDB for real-time analytics. Introducing RTABench 🚀

Read full post →

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

DEV shines when you're signed in, unlocking a customized experience with features like dark mode!

Okay