Webpage Readable Content Extraction API by GuGuData: Extract Key Information from Webpages

GuGuData's Webpage Readable Content Extraction API is an intelligent tool that allows you to extract the essential elements of an article from any webpage. Whether you're processing HTML content directly or extracting information from a URL, this API provides a reliable solution for reading web content in an easily digestible format.

Why Choose GuGuData's Webpage Readable Content Extraction API?

Our Webpage Readable Content Extraction API comes with advanced features that make it the perfect choice for extracting readable web content:

1. Intelligent Content Extraction

Using machine learning techniques, our API intelligently extracts key elements from webpages, such as the article title, author, publication time, and much more.

2. Supports HTML and URL Input

You have the flexibility to provide either the raw HTML content or the URL of the webpage. This flexibility ensures the API can handle a wide variety of use cases, including dynamic or static websites.

3. Extract Various Article Elements

Our API extracts key elements such as the article title, byline (author), text direction, language, content (with or without HTML tags), the article length, excerpt, site name, and publication time.

4. High-Concurrency and Fast Response

With second-level parsing performance, this API is designed to handle high-concurrency environments, ensuring rapid responses even when processing large volumes of requests.

5. Nationwide CDN Deployment

Our API is deployed across multiple nodes nationwide, ensuring fast and reliable access with minimal latency.

Key Features

Intelligent Content Extraction: Automatically extract key elements from webpages.
Supports HTML and URLs: Input either raw HTML or a webpage URL for content extraction.
Multiple Article Elements: Extract the title, byline, content, and more.
Rapid Response Time: High-performance parsing with support for high-concurrency use cases.
Nationwide CDN: Fast and reliable access to the API through a multi-node CDN deployment.
HTTPS and TLS Support: Secure data transmission with full HTTPS and TLS support.
Apple ATS Compatible: Fully compatible with Apple's App Transport Security.
Load Balancing: Optimized performance through multi-server load balancing.

API Documentation

The Webpage Readable Content Extraction API is easy to use and integrates seamlessly with your existing applications. Below are the details on how to implement the API.

API Endpoint

To extract readable content from a webpage, make a POST request to the following endpoint:

POST https://api.gugudata.io/v1/websitetools/readability
Content-Type: application/json; charset=utf-8

For testing purposes, you can use our demo endpoint:

https://api.gugudata.io/v1/websitetools/readability/demo

Request Parameters

Parameter Name	Type	Required	Default Value	Description
`appkey`	string	Yes	YOUR_APPKEY	The APPKEY obtained after registration.
`html`	string	No	YOUR_VALUE	The raw HTML content of the webpage to extract. Either this or `url` must be provided.
`url`	string	No	YOUR_VALUE	The URL of the webpage to extract. Either this or `html` must be provided. Note: Pages with anti-crawling or access restrictions may not work properly if they block content access.

Sample Request

{
    "appkey": "YOUR_APPKEY",
    "html": "<html><body><h1>Article Title</h1><p>This is the content of the article.</p></body></html>"
}

Response Parameters

Parameter Name	Type	Description
`DataStatus.StatusCode`	int	API response status code.
`DataStatus.StatusDescription`	string	API response status description.
`DataStatus.ResponseDateTime`	string	Timestamp of the response.
`DataStatus.DataTotalCount`	int	Total data count, generally used for pagination calculations.
`Data.Title`	string	The article title.
`Data.Byline`	string	The article author or byline.
`Data.Dir`	string	The text direction of the article (LTR or RTL).
`Data.Lang`	string	The language of the article.
`Data.Content`	string	The full content of the article in HTML format.
`Data.TextContent`	string	The plain text content of the article (without HTML tags).
`Data.Length`	int	The length of the article content.
`Data.Excerpt`	string	A short excerpt from the article.
`Data.SiteName`	string	The name of the website where the article is published.
`Data.PublishedTime`	string	The published time of the article, if available.

Sample Response

{
    "DataStatus": {
        "StatusCode": 200,
        "StatusDescription": "Normal return",
        "ResponseDateTime": "2021-05-13T00:00:00Z",
        "DataTotalCount": 1
    },
    "Data": {
        "Title": "Sample Article",
        "Byline": "Author Name",
        "Dir": "LTR",
        "Lang": "en",
        "Content": "<h1>Sample Article</h1><p>This is the article content.</p>",
        "TextContent": "Sample Article
This is the article content.",
        "Length": 45,
        "Excerpt": "This is the article content.",
        "SiteName": "Sample Website",
        "PublishedTime": ["2021-05-13T12:00:00Z"]
    }
}

API Error Codes

Error Code	Description	Remarks
200	Normal return	Successful API call.
400	Parameter error	Check if all required parameters are provided.
429	Request frequency limited	Cannot exceed 100 requests per second.
403	Account in arrears	Account has expired; please renew your subscription.
402	APPKEY error	Verify that the APPKEY is correct and active.
500	API response error	General API error; contact support if it persists.

How to Get Started

Obtain Your APPKEY: Register at GuGuData to obtain your APPKEY for API authentication.
Choose Input Type: Decide whether you will extract content using the raw HTML of the webpage or the webpage URL.
Make a POST Request: Send a POST request to the API endpoint with the required parameters.
Retrieve Article Elements: Receive the extracted elements such as the title, content, and author.
Integrate and Automate: Incorporate the API into your applications or workflows for automated content extraction.

Use Cases

Content Scraping: Extract readable content from websites for analysis or republishing.
Article Summarization: Extract the key elements of articles for automated summarization.
SEO and Metadata Analysis: Retrieve key content from web pages for SEO or metadata analysis.
Archiving: Store web content in an easily digestible format for archival purposes.

Conclusion

GuGuData's Webpage Readable Content Extraction API is a powerful tool for extracting the essential elements of articles from any webpage. With support for both HTML and URL input, high-performance parsing, and the ability to retrieve a variety of article elements, this API is the perfect solution for anyone looking to make sense of web content.

Get started with GuGuData's Webpage Readable Content Extraction API today! and start extracting key information from web pages with ease.