DEV Community

Cover image for Scrape raw HTML with AgentQL's REST API
Rachel-Lee Nabors for AgentQL

Posted on • Originally published at agentql.com

2

Scrape raw HTML with AgentQL's REST API

Sometimes you need to extract data from HTML but you don't have a URL to pass to AgentQL's REST API. Fret not: now our REST API endpoint supports querying directly from raw HTML!

With this functionality, you can scrape data from pages even if you're working behind a firewall, fetching pages with a custom crawler, or integrating with internal tools. Pass the HTML as a string and your AgentQL query, and AgentQL will return structured data in JSON.

.

You asked for it: scraping web pages without a URL

You asked if it was possible to scrape data without Playwright. You told us you were already fetching HTML using custom crawlers. We heard you! This new capability is perfect for querying data from:

  • Private and internal network pages
  • Previously crawled pages and HTML dumps
  • Archived HTML files and snapshots

It can even be used to scrape difficult-to-reach and heavily anti-botted pages. You can navigate to the page using a stealth crawler or your own browser, save the page's HTML or copy it as a string, and follow the steps below!

How to extract data from an HTML string

You can pass HTML directly in your API request like so:

curl -L 'https://api.agentql.com/v1/query-data' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <YOUR-API-KEY>' \
-d '{
  "html": "<!DOCTYPE html><html><body><h1>Main Page</h1></body></html>",
  "query": "{ heading }"
}'
Enter fullscreen mode Exit fullscreen mode

AgentQL will process the HTML and return structured JSON:

{
  "heading": "Main Page"
}
Enter fullscreen mode Exit fullscreen mode

Got a large, unwieldy chunk of HTML? Or a local file(s) you want to send without the copy-pasting all the HTML every time? Most HTML is going to run into JSON formatting errors if you pass it through raw, anyway. Try this out:

curl -L 'https://api.agentql.com/v1/query-data' \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <YOUR-API-KEY>' \
-d "$(jq -n \
 --arg html "$(cat blog.html)" \
 '{query: "{ heading }", html: $html}'
)"
Enter fullscreen mode Exit fullscreen mode

This combines reading the file with cat alongside jq's power to properly format HTML for a JSON context (escaping double quotes, etc).

Get started extracting data with HTML and AgentQL

This feature is available now—no opt-in or special flag required. Learn more in our guide to getting data from HTML with AgentQL or the REST API Reference

If you have any questions, join our Discord, and we will help you out. We love hearing from you! Find us on X, or Bluesky, too!

—The TinyFish team building AgentQL

Heroku

Built for developers, by developers.

Whether you're building a simple prototype or a business-critical product, Heroku's fully-managed platform gives you the simplest path to delivering apps quickly — using the tools and languages you already love!

Learn More

Top comments (0)

Sentry image

Make it make sense

Make sense of fixing your code with straight-forward application monitoring.

Start debugging →

👋 Kindness is contagious

Engage with a wealth of insights in this thoughtful article, valued within the supportive DEV Community. Coders of every background are welcome to join in and add to our collective wisdom.

A sincere "thank you" often brightens someone’s day. Share your gratitude in the comments below!

On DEV, the act of sharing knowledge eases our journey and fortifies our community ties. Found value in this? A quick thank you to the author can make a significant impact.

Okay