DEV Community

GuGuData
GuGuData

Posted on

How to Extract Readable Content from Webpages with an API

How to Extract Readable Content from Webpages with an API

1. What problem are we solving

Extracting readable content from a webpage is challenging because raw HTML includes ads, navigation bars, comments and other extraneous elements.

Developers often need just the main text of an article for tasks like summarization, analysis or offline storage.

Parsing this content manually is error‑prone and time‑consuming, especially when dealing with many different website structures.

This guide shows how to reliably extract clean, readable content using an API.

2. When you need this solution

  • Building a reading mode for your application
  • Creating a content summarization pipeline
  • Performing natural language processing on article text
  • Saving readable text for offline access ## 3. How the API solution works Rather than scraping and parsing HTML yourself, a readability API automatically identifies the main content within a webpage. It strips away navigation menus, advertisements and other unrelated elements to return the core article. The API delivers structured metadata such as title, author, language and publication time alongside the clean text. By relying on the API, you can handle different website layouts consistently without constant maintenance. ## 4. Step‑by‑step: Extract readable content using an API Step 1: Provide the webpage URL or HTML you want to extract content from Step 2: Send nd the URL or HTML Step 3: Receive a JSON response containing the article’s metadata and clean text Step 4: Use the extracted content in your application ## 5. API request example Endpoint POST https://api.gugudata.io/v1/websitetools/readability Headers
  • Content‑Type: application/json
  • Authorization: Bearer YOUR_APPKEY Body
{  
  "url": "https://example.com/article"  
}  
Enter fullscreen mode Exit fullscreen mode

cURL

curl -X POST "https://api.gugudata.io/v1/websitetools/readability" \  
  -H "Content-Type: application/json" \  
  -H "Authorization: Bearer YOUR_APPKEY" \  
  -d '{"url": "https://example.com/article"}'  
Enter fullscreen mode Exit fullscreen mode

JavaScript

const fetch = require('node-fetch');  
async function extractReadable() {  
  const response = await fetch('https://api.gugudata.io/v1/websitetools/readability', {  
    method: 'POST',  
    headers: {  
      'Content-Type': 'application/json',  
      'Authorization': 'Bearer YOUR_APPKEY'  
    },  
    body: JSON.stringify({ url: 'https://example.com/article' })  
  });  
  const result = await response.json();  
  console.log(result.data.TextContent);  
}  
extractReadable();  
Enter fullscreen mode Exit fullscreen mode

Python

import requests  
def extract_readable():  
    url = 'https://api.gugudata.io/v1/websitetools/readability'  
    headers = {  
        'Content-Type': 'application/json',  
        'Authorization': 'Bearer YOUR_APPKEY'  
    }  
    payload = {  
        'url': 'https://example.com/article'  
    }  
    response = requests.post(url, json=payload, headers=headers)  
    data = response.json()  
    print(data['data']['TextContent'])  
extract_readable()  
Enter fullscreen mode Exit fullscreen mode

Example response

{  
  "dataStatus": {  
    "statusCode": 200,  
    "statusDescription": "OK"  
  },  
  "data": {  
    "Title": "Example Article Title",  
    "Byline": "Author Name",  
    "Lang": "en",  
    "Content": "<p>This is the clean HTML of the article...</p>",  
    "TextContent": "This is the clean text of the article...",  
    "Length": 1234,  
    "Excerpt": "This is a short excerpt...",  
    "SiteName": "Example Site",  
    "PublishedTime": "2025-12-01T12:34:56Z"  
  }  
}  
Enter fullscreen mode Exit fullscreen mode

6. Why use an API instead of manual scraping

Manually scraping and parsing webpages requires custom logic for every site and ongoing maintenance when layouts change.

An API‑based solution hides this complexity and reliably returns the main content along with useful metadata.

Using the API saves development time and reduces the risk of breaking when websites update their design.

7. Frequently Asked Questions

Can this API extract readable content from dynamic or JavaScript‑rendered pages?

Most static pages are supported, but pages that require JavaScript to render content may not return full results.

Is the extracted content accurate?

The API uses advanced algorithms to identify the main article body and remove noise, providing accurate results in most cases.

Does the API support non‑English websites?

Yes, the readability API can extract content from pages in various languages and will return the language code in the response.

Do I need to send both the HTML and the URL?

No, you can provide either the full HTML of the page or the URL to fetch; the API will process whichever is provided.

8. Next steps

If you want to extract readable content from webpages without writing your own parser, you can try a readability API and integrate it into your workflow in just a few minutes. a POST request to the readability API with your app key a

Top comments (0)