Most web scrapers give you raw HTML. I wanted clean, structured JSON from any URL — no configuration, no selectors, no parsing. Just pass a URL and get organized data back.
So I built it.
What It Does
The Web Content Extractor API takes any URL and returns structured JSON. It automatically detects the content type:
- Articles → title, author, date, full text, headings
- Products → name, price, rating, reviews, SKU, images
- Recipes → ingredients, instructions, prep time, servings
- Job postings → title, company, salary, location
- Events → name, date, location, description
- Any webpage → metadata, content, links, images
How It Works
- Fetch the HTML
- Auto-detect content type using Open Graph tags, Schema.org markup, and DOM signals
- Score content blocks (Readability-style algorithm) to find the main content
- Extract structured data: metadata, headings, images, links, JSON-LD
- Return clean JSON
Quick Example
curl "https://george-the-developer--web-content-extractor-api.apify.actor/extract?url=https://techcrunch.com&token=YOUR_TOKEN"
Response:
{
"url": "https://techcrunch.com/article",
"type": "article",
"metadata": {
"title": "AI Agents Are Reshaping Enterprise Software",
"author": "Sarah Perez",
"date": "2026-03-24",
"siteName": "TechCrunch"
},
"content": {
"text": "The rise of AI agents represents...",
"headings": [{"level": 2, "text": "What Are AI Agents?"}],
"wordCount": 2847
}
}
Why I Built This
I maintain 35+ data extraction APIs on Apify. The most common request I get: "I just need the main content from this URL as JSON."
Every developer building RAG pipelines, content aggregators, or AI agents needs this. But existing solutions are either:
- Too expensive (Diffbot = $299/month)
- Too complex (custom scrapers = days of work)
- Too slow (most take 5-30 seconds)
This does it in 1-3 seconds for $0.003 per extraction.
Use Cases
- RAG Pipelines — Feed clean text into vector databases for AI retrieval
- News Aggregation — Pull articles from 100+ sources into structured data
- Competitive Intelligence — Monitor competitor product pages
- Content Repurposing — Extract blog posts to repurpose across channels
Batch Processing
Need multiple URLs? The /batch endpoint handles up to 25 URLs in parallel:
POST /batch
{
"urls": ["https://url1.com", "https://url2.com"],
"format": "article"
}
Try It
- Apify Store: Web Content Extractor API
- Also on RapidAPI: Search "Web Content Extractor"
- Pricing: $0.003/extraction (PPE — only pay for what you use)
Built with Node.js + Cheerio on Apify Standby (instant HTTP response, no queue).
Questions? Hit me up @ai_in_it on X.
Top comments (0)