Web Content Extraction APIs: Turn URLs into Readable Data, JSON, Links, and Screenshots
Many developer workflows start with a URL. The next step may be extracting readable article text, converting a page to Markdown, collecting links, capturing a screenshot, or checking website metadata before storing a record.
GuGuData website tools APIs provide URL-focused endpoints that help developers turn web pages and domains into structured outputs for products, data pipelines, and internal automation.
API lineup
| Workflow | Method | Endpoint | Detail page |
|---|---|---|---|
| Readable content extraction | POST |
/v1/websitetools/readability |
Webpage Readable Content Extraction |
| URL to HTML | POST |
/v1/websitetools/url2html |
Fetch Rendered HTML from URL |
| URL to Markdown | POST |
/v1/websitetools/url2markdown |
Convert URL to Markdown |
| URL to structured JSON | POST |
/v1/websitetools/url2json |
Extract Structured JSON from Webpage |
| URL to links | POST |
/v1/websitetools/url2links |
Extract Links from URL |
| URL screenshot | POST |
/v1/websitetools/url2snapshot |
Webpage Screenshot Capture |
| URL to image | POST |
/v1/websitetools/url2image |
Convert URL to Image |
| URL to static file | POST |
/v1/websitetools/url2html |
URL to Static File |
| Favicon lookup | GET |
/v1/websitetools/favicon |
Website Favicon Extraction |
| DNS lookup | GET |
/v1/websitetools/dns-lookup |
Domain DNS Information Query |
| SSL certificate info | GET |
/v1/websitetools/sslcertinfo |
Domain SSL Certificate Information Parsing |
| WHOIS lookup | GET |
/v1/websitetools/whois |
Domain WHOIS Information Lookup |
The public OpenAPI JSON is available at https://gugudata.io/assets/openapi/gugudata.openapi.3.1.json.
When to use these APIs
- Build article ingestion pipelines that need readable page content.
- Convert web pages into Markdown for knowledge bases, AI workflows, or archival systems.
- Extract structured JSON from pages using a prompt-driven workflow.
- Capture page screenshots for review, monitoring, or visual records.
- Audit domain metadata such as DNS records, SSL certificates, favicon, or WHOIS data.
- Normalize URL processing behind one server-side integration layer.
Choosing the right endpoint
Use readability extraction when your goal is the main article body. Use URL to Markdown when the output needs to be readable, portable, and friendly to documentation or AI workflows.
Use URL to JSON when the page has fields you want to extract into a structured shape. Use URL to links when you need discovery, crawling, or link inventory.
Use screenshot and image endpoints when the visual state of the page matters. Use DNS, SSL, favicon, and WHOIS endpoints when the domain itself is the target.
Example request
curl -X POST "https://api.gugudata.io/v1/websitetools/url2json?appkey=YOUR_APPKEY" \
-H "Content-Type: application/json" \
-d '
{
"url": "https://example.com/article",
"prompt": "Extract the article title, author, published date, and a short summary."
}
'
Response handling
Most website tools return standard JSON responses with dataStatus and data. The exact data shape depends on the endpoint.
{
"dataStatus": {
"statusCode": 200,
"status": "SUCCESS",
"statusDescription": "successfully",
"responseDateTime": "2026-04-29T00:00:00Z",
"dataTotalCount": 1,
"requestParameter": ""
},
"data": {
"title": "Example article",
"summary": "Short extracted summary"
}
}
For URL workflows, store the original URL, request time, and endpoint name with the result. This makes retries, audits, and freshness checks easier.
HTTP status codes
| HTTP status | Meaning | Recommended handling |
|---|---|---|
200 |
Request processed successfully. | Parse the documented response body for the endpoint result. |
400 |
Invalid request parameters or request format. | Check URL format, prompt content, and request body structure. |
401 |
Missing or unknown application key. | Send a valid appkey with the request. |
403 |
The application key is recognized but access is not allowed. | Check subscription, trial state, and endpoint access. |
429 |
Request rate or trial usage limit exceeded. | Reduce concurrency or retry after the limit window resets. |
500 |
Internal service error. | Retry later or contact support if the error persists. |
503 |
Upstream service unavailable. | Retry later when the dependency is available again. |
Implementation notes
- Validate and normalize URLs before calling extraction endpoints.
- Keep URL processing on the backend so credentials and retry behavior remain controlled.
- Add timeouts around downstream workflows because external pages can be slow or unavailable.
- Store extraction metadata so results can be refreshed without losing the original request context.
- Use demo endpoints for quick checks, then move to authenticated production endpoints.
FAQ
Which endpoint should I use for AI ingestion?
Use Convert URL to Markdown when readable text is enough. Use Extract Structured JSON from Webpage when you need specific fields.
Should I crawl large sites directly from these endpoints?
Use controlled queues and rate limits. URL processing depends on external pages, so retries and concurrency limits should be explicit in your own system.
Can I combine metadata checks with content extraction?
Yes. For example, you can check DNS or SSL metadata before extracting page content, then store both the domain-level record and the page-level output.
For more developer APIs, visit GuGuData.
Top comments (0)