Webpage Readable Content Extraction API by GuGuData: Extract Key Information from Webpages
GuGuData's Webpage Readable Content Extraction API is an intelligent tool that allows you to extract the essential elements of an article from any webpage. Whether you're processing HTML content directly or extracting information from a URL, this API provides a reliable solution for reading web content in an easily digestible format.
Why Choose GuGuData's Webpage Readable Content Extraction API?
Our Webpage Readable Content Extraction API comes with advanced features that make it the perfect choice for extracting readable web content:
1. Intelligent Content Extraction
Using machine learning techniques, our API intelligently extracts key elements from webpages, such as the article title, author, publication time, and much more.
2. Supports HTML and URL Input
You have the flexibility to provide either the raw HTML content or the URL of the webpage. This flexibility ensures the API can handle a wide variety of use cases, including dynamic or static websites.
3. Extract Various Article Elements
Our API extracts key elements such as the article title, byline (author), text direction, language, content (with or without HTML tags), the article length, excerpt, site name, and publication time.
4. High-Concurrency and Fast Response
With second-level parsing performance, this API is designed to handle high-concurrency environments, ensuring rapid responses even when processing large volumes of requests.
5. Nationwide CDN Deployment
Our API is deployed across multiple nodes nationwide, ensuring fast and reliable access with minimal latency.
Key Features
- Intelligent Content Extraction: Automatically extract key elements from webpages.
- Supports HTML and URLs: Input either raw HTML or a webpage URL for content extraction.
- Multiple Article Elements: Extract the title, byline, content, and more.
- Rapid Response Time: High-performance parsing with support for high-concurrency use cases.
- Nationwide CDN: Fast and reliable access to the API through a multi-node CDN deployment.
- HTTPS and TLS Support: Secure data transmission with full HTTPS and TLS support.
- Apple ATS Compatible: Fully compatible with Apple's App Transport Security.
- Load Balancing: Optimized performance through multi-server load balancing.
API Documentation
The Webpage Readable Content Extraction API is easy to use and integrates seamlessly with your existing applications. Below are the details on how to implement the API.
API Endpoint
To extract readable content from a webpage, make a POST
request to the following endpoint:
POST https://api.gugudata.io/v1/websitetools/readability
Content-Type: application/json; charset=utf-8
For testing purposes, you can use our demo endpoint:
https://api.gugudata.io/v1/websitetools/readability/demo
Request Parameters
Parameter Name | Type | Required | Default Value | Description |
---|---|---|---|---|
appkey |
string | Yes | YOUR_APPKEY | The APPKEY obtained after registration. |
html |
string | No | YOUR_VALUE | The raw HTML content of the webpage to extract. Either this or url must be provided. |
url |
string | No | YOUR_VALUE | The URL of the webpage to extract. Either this or html must be provided. Note: Pages with anti-crawling or access restrictions may not work properly if they block content access. |
Sample Request
{
"appkey": "YOUR_APPKEY",
"html": "<html><body><h1>Article Title</h1><p>This is the content of the article.</p></body></html>"
}
Response Parameters
Parameter Name | Type | Description |
---|---|---|
DataStatus.StatusCode |
int | API response status code. |
DataStatus.StatusDescription |
string | API response status description. |
DataStatus.ResponseDateTime |
string | Timestamp of the response. |
DataStatus.DataTotalCount |
int | Total data count, generally used for pagination calculations. |
Data.Title |
string | The article title. |
Data.Byline |
string | The article author or byline. |
Data.Dir |
string | The text direction of the article (LTR or RTL). |
Data.Lang |
string | The language of the article. |
Data.Content |
string | The full content of the article in HTML format. |
Data.TextContent |
string | The plain text content of the article (without HTML tags). |
Data.Length |
int | The length of the article content. |
Data.Excerpt |
string | A short excerpt from the article. |
Data.SiteName |
string | The name of the website where the article is published. |
Data.PublishedTime |
string | The published time of the article, if available. |
Sample Response
{
"DataStatus": {
"StatusCode": 200,
"StatusDescription": "Normal return",
"ResponseDateTime": "2021-05-13T00:00:00Z",
"DataTotalCount": 1
},
"Data": {
"Title": "Sample Article",
"Byline": "Author Name",
"Dir": "LTR",
"Lang": "en",
"Content": "<h1>Sample Article</h1><p>This is the article content.</p>",
"TextContent": "Sample Article
This is the article content.",
"Length": 45,
"Excerpt": "This is the article content.",
"SiteName": "Sample Website",
"PublishedTime": ["2021-05-13T12:00:00Z"]
}
}
API Error Codes
Error Code | Description | Remarks |
---|---|---|
200 | Normal return | Successful API call. |
400 | Parameter error | Check if all required parameters are provided. |
429 | Request frequency limited | Cannot exceed 100 requests per second. |
403 | Account in arrears | Account has expired; please renew your subscription. |
402 | APPKEY error | Verify that the APPKEY is correct and active. |
500 | API response error | General API error; contact support if it persists. |
How to Get Started
Obtain Your APPKEY: Register at GuGuData to obtain your APPKEY for API authentication.
Choose Input Type: Decide whether you will extract content using the raw HTML of the webpage or the webpage URL.
Make a POST Request: Send a POST request to the API endpoint with the required parameters.
Retrieve Article Elements: Receive the extracted elements such as the title, content, and author.
Integrate and Automate: Incorporate the API into your applications or workflows for automated content extraction.
Use Cases
- Content Scraping: Extract readable content from websites for analysis or republishing.
- Article Summarization: Extract the key elements of articles for automated summarization.
- SEO and Metadata Analysis: Retrieve key content from web pages for SEO or metadata analysis.
- Archiving: Store web content in an easily digestible format for archival purposes.
Conclusion
GuGuData's Webpage Readable Content Extraction API is a powerful tool for extracting the essential elements of articles from any webpage. With support for both HTML and URL input, high-performance parsing, and the ability to retrieve a variety of article elements, this API is the perfect solution for anyone looking to make sense of web content.
Get started with GuGuData's Webpage Readable Content Extraction API today! and start extracting key information from web pages with ease.
Top comments (0)