I Built an API That Turns Any URL Into Structured JSON — Here's How

#api #webdev #ai #javascript

Most web scrapers give you raw HTML. I wanted clean, structured JSON from any URL — no configuration, no selectors, no parsing. Just pass a URL and get organized data back.

So I built it.

What It Does

The Web Content Extractor API takes any URL and returns structured JSON. It automatically detects the content type:

Articles → title, author, date, full text, headings
Products → name, price, rating, reviews, SKU, images
Recipes → ingredients, instructions, prep time, servings
Job postings → title, company, salary, location
Events → name, date, location, description
Any webpage → metadata, content, links, images

How It Works

Fetch the HTML
Auto-detect content type using Open Graph tags, Schema.org markup, and DOM signals
Score content blocks (Readability-style algorithm) to find the main content
Extract structured data: metadata, headings, images, links, JSON-LD
Return clean JSON

Quick Example

curl "https://george-the-developer--web-content-extractor-api.apify.actor/extract?url=https://techcrunch.com&token=YOUR_TOKEN"

Response:

{
  "url": "https://techcrunch.com/article",
  "type": "article",
  "metadata": {
    "title": "AI Agents Are Reshaping Enterprise Software",
    "author": "Sarah Perez",
    "date": "2026-03-24",
    "siteName": "TechCrunch"
  },
  "content": {
    "text": "The rise of AI agents represents...",
    "headings": [{"level": 2, "text": "What Are AI Agents?"}],
    "wordCount": 2847
  }
}

Why I Built This

I maintain 35+ data extraction APIs on Apify. The most common request I get: "I just need the main content from this URL as JSON."

Every developer building RAG pipelines, content aggregators, or AI agents needs this. But existing solutions are either:

Too expensive (Diffbot = $299/month)
Too complex (custom scrapers = days of work)
Too slow (most take 5-30 seconds)

This does it in 1-3 seconds for $0.003 per extraction.

Use Cases

RAG Pipelines — Feed clean text into vector databases for AI retrieval
News Aggregation — Pull articles from 100+ sources into structured data
Competitive Intelligence — Monitor competitor product pages
Content Repurposing — Extract blog posts to repurpose across channels

Batch Processing

Need multiple URLs? The /batch endpoint handles up to 25 URLs in parallel:

POST /batch
{
  "urls": ["https://url1.com", "https://url2.com"],
  "format": "article"
}

Try It

Apify Store: Web Content Extractor API
Also on RapidAPI: Search "Web Content Extractor"
Pricing: $0.003/extraction (PPE — only pay for what you use)

Built with Node.js + Cheerio on Apify Standby (instant HTTP response, no queue).

Questions? Hit me up @ai_in_it on X.

DEV Community