DEV Community

GuGuData
GuGuData

Posted on

Webpage Readable Content Extraction API by GuGuData: Extract Key Information from Webpages

Webpage Readable Content Extraction API by GuGuData: Extract Key Information from Webpages

GuGuData's Webpage Readable Content Extraction API is an intelligent tool that allows you to extract the essential elements of an article from any webpage. Whether you're processing HTML content directly or extracting information from a URL, this API provides a reliable solution for reading web content in an easily digestible format.

Why Choose GuGuData's Webpage Readable Content Extraction API?

Our Webpage Readable Content Extraction API comes with advanced features that make it the perfect choice for extracting readable web content:

1. Intelligent Content Extraction

Using machine learning techniques, our API intelligently extracts key elements from webpages, such as the article title, author, publication time, and much more.

2. Supports HTML and URL Input

You have the flexibility to provide either the raw HTML content or the URL of the webpage. This flexibility ensures the API can handle a wide variety of use cases, including dynamic or static websites.

3. Extract Various Article Elements

Our API extracts key elements such as the article title, byline (author), text direction, language, content (with or without HTML tags), the article length, excerpt, site name, and publication time.

4. High-Concurrency and Fast Response

With second-level parsing performance, this API is designed to handle high-concurrency environments, ensuring rapid responses even when processing large volumes of requests.

5. Nationwide CDN Deployment

Our API is deployed across multiple nodes nationwide, ensuring fast and reliable access with minimal latency.


Key Features

  • Intelligent Content Extraction: Automatically extract key elements from webpages.
  • Supports HTML and URLs: Input either raw HTML or a webpage URL for content extraction.
  • Multiple Article Elements: Extract the title, byline, content, and more.
  • Rapid Response Time: High-performance parsing with support for high-concurrency use cases.
  • Nationwide CDN: Fast and reliable access to the API through a multi-node CDN deployment.
  • HTTPS and TLS Support: Secure data transmission with full HTTPS and TLS support.
  • Apple ATS Compatible: Fully compatible with Apple's App Transport Security.
  • Load Balancing: Optimized performance through multi-server load balancing.

API Documentation

The Webpage Readable Content Extraction API is easy to use and integrates seamlessly with your existing applications. Below are the details on how to implement the API.

API Endpoint

To extract readable content from a webpage, make a POST request to the following endpoint:

POST https://api.gugudata.io/v1/websitetools/readability
Content-Type: application/json; charset=utf-8
Enter fullscreen mode Exit fullscreen mode

For testing purposes, you can use our demo endpoint:

https://api.gugudata.io/v1/websitetools/readability/demo
Enter fullscreen mode Exit fullscreen mode

Request Parameters

Parameter Name Type Required Default Value Description
appkey string Yes YOUR_APPKEY The APPKEY obtained after registration.
html string No YOUR_VALUE The raw HTML content of the webpage to extract. Either this or url must be provided.
url string No YOUR_VALUE The URL of the webpage to extract. Either this or html must be provided. Note: Pages with anti-crawling or access restrictions may not work properly if they block content access.

Sample Request

{
    "appkey": "YOUR_APPKEY",
    "html": "<html><body><h1>Article Title</h1><p>This is the content of the article.</p></body></html>"
}
Enter fullscreen mode Exit fullscreen mode

Response Parameters

Parameter Name Type Description
DataStatus.StatusCode int API response status code.
DataStatus.StatusDescription string API response status description.
DataStatus.ResponseDateTime string Timestamp of the response.
DataStatus.DataTotalCount int Total data count, generally used for pagination calculations.
Data.Title string The article title.
Data.Byline string The article author or byline.
Data.Dir string The text direction of the article (LTR or RTL).
Data.Lang string The language of the article.
Data.Content string The full content of the article in HTML format.
Data.TextContent string The plain text content of the article (without HTML tags).
Data.Length int The length of the article content.
Data.Excerpt string A short excerpt from the article.
Data.SiteName string The name of the website where the article is published.
Data.PublishedTime string The published time of the article, if available.

Sample Response

{
    "DataStatus": {
        "StatusCode": 200,
        "StatusDescription": "Normal return",
        "ResponseDateTime": "2021-05-13T00:00:00Z",
        "DataTotalCount": 1
    },
    "Data": {
        "Title": "Sample Article",
        "Byline": "Author Name",
        "Dir": "LTR",
        "Lang": "en",
        "Content": "<h1>Sample Article</h1><p>This is the article content.</p>",
        "TextContent": "Sample Article
This is the article content.",
        "Length": 45,
        "Excerpt": "This is the article content.",
        "SiteName": "Sample Website",
        "PublishedTime": ["2021-05-13T12:00:00Z"]
    }
}
Enter fullscreen mode Exit fullscreen mode

API Error Codes

Error Code Description Remarks
200 Normal return Successful API call.
400 Parameter error Check if all required parameters are provided.
429 Request frequency limited Cannot exceed 100 requests per second.
403 Account in arrears Account has expired; please renew your subscription.
402 APPKEY error Verify that the APPKEY is correct and active.
500 API response error General API error; contact support if it persists.

How to Get Started

  1. Obtain Your APPKEY: Register at GuGuData to obtain your APPKEY for API authentication.

  2. Choose Input Type: Decide whether you will extract content using the raw HTML of the webpage or the webpage URL.

  3. Make a POST Request: Send a POST request to the API endpoint with the required parameters.

  4. Retrieve Article Elements: Receive the extracted elements such as the title, content, and author.

  5. Integrate and Automate: Incorporate the API into your applications or workflows for automated content extraction.


Use Cases

  • Content Scraping: Extract readable content from websites for analysis or republishing.
  • Article Summarization: Extract the key elements of articles for automated summarization.
  • SEO and Metadata Analysis: Retrieve key content from web pages for SEO or metadata analysis.
  • Archiving: Store web content in an easily digestible format for archival purposes.

Conclusion

GuGuData's Webpage Readable Content Extraction API is a powerful tool for extracting the essential elements of articles from any webpage. With support for both HTML and URL input, high-performance parsing, and the ability to retrieve a variety of article elements, this API is the perfect solution for anyone looking to make sense of web content.

Get started with GuGuData's Webpage Readable Content Extraction API today! and start extracting key information from web pages with ease.

Top comments (0)