How to Extract Readable Content from Webpages with an API
1. What problem are we solving
Extracting readable content from a webpage is challenging because raw HTML includes ads, navigation bars, comments and other extraneous elements.
Developers often need just the main text of an article for tasks like summarization, analysis or offline storage.
Parsing this content manually is error‑prone and time‑consuming, especially when dealing with many different website structures.
This guide shows how to reliably extract clean, readable content using an API.
2. When you need this solution
- Building a reading mode for your application
- Creating a content summarization pipeline
- Performing natural language processing on article text
- Saving readable text for offline access
## 3. How the API solution works
Rather than scraping and parsing HTML yourself, a readability API automatically identifies the main content within a webpage.
It strips away navigation menus, advertisements and other unrelated elements to return the core article.
The API delivers structured metadata such as title, author, language and publication time alongside the clean text.
By relying on the API, you can handle different website layouts consistently without constant maintenance.
## 4. Step‑by‑step: Extract readable content using an API
Step 1: Provide the webpage URL or HTML you want to extract content from
Step 2: Send nd the URL or HTML
Step 3: Receive a JSON response containing the article’s metadata and clean text
Step 4: Use the extracted content in your application
## 5. API request example
Endpoint
POST https://api.gugudata.io/v1/websitetools/readabilityHeaders -
Content‑Type: application/json -
Authorization: Bearer YOUR_APPKEYBody
{
"url": "https://example.com/article"
}
cURL
curl -X POST "https://api.gugudata.io/v1/websitetools/readability" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_APPKEY" \
-d '{"url": "https://example.com/article"}'
JavaScript
const fetch = require('node-fetch');
async function extractReadable() {
const response = await fetch('https://api.gugudata.io/v1/websitetools/readability', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Bearer YOUR_APPKEY'
},
body: JSON.stringify({ url: 'https://example.com/article' })
});
const result = await response.json();
console.log(result.data.TextContent);
}
extractReadable();
Python
import requests
def extract_readable():
url = 'https://api.gugudata.io/v1/websitetools/readability'
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer YOUR_APPKEY'
}
payload = {
'url': 'https://example.com/article'
}
response = requests.post(url, json=payload, headers=headers)
data = response.json()
print(data['data']['TextContent'])
extract_readable()
Example response
{
"dataStatus": {
"statusCode": 200,
"statusDescription": "OK"
},
"data": {
"Title": "Example Article Title",
"Byline": "Author Name",
"Lang": "en",
"Content": "<p>This is the clean HTML of the article...</p>",
"TextContent": "This is the clean text of the article...",
"Length": 1234,
"Excerpt": "This is a short excerpt...",
"SiteName": "Example Site",
"PublishedTime": "2025-12-01T12:34:56Z"
}
}
6. Why use an API instead of manual scraping
Manually scraping and parsing webpages requires custom logic for every site and ongoing maintenance when layouts change.
An API‑based solution hides this complexity and reliably returns the main content along with useful metadata.
Using the API saves development time and reduces the risk of breaking when websites update their design.
7. Frequently Asked Questions
Can this API extract readable content from dynamic or JavaScript‑rendered pages?
Most static pages are supported, but pages that require JavaScript to render content may not return full results.
Is the extracted content accurate?
The API uses advanced algorithms to identify the main article body and remove noise, providing accurate results in most cases.
Does the API support non‑English websites?
Yes, the readability API can extract content from pages in various languages and will return the language code in the response.
Do I need to send both the HTML and the URL?
No, you can provide either the full HTML of the page or the URL to fetch; the API will process whichever is provided.
8. Next steps
If you want to extract readable content from webpages without writing your own parser, you can try a readability API and integrate it into your workflow in just a few minutes. a POST request to the readability API with your app key a
Top comments (0)