DEV Community

The AI Entrepreneur
The AI Entrepreneur

Posted on

I Built an API That Turns Any URL Into Structured JSON — Here's How

Most web scrapers give you raw HTML. I wanted clean, structured JSON from any URL — no configuration, no selectors, no parsing. Just pass a URL and get organized data back.

So I built it.

What It Does

The Web Content Extractor API takes any URL and returns structured JSON. It automatically detects the content type:

  • Articles → title, author, date, full text, headings
  • Products → name, price, rating, reviews, SKU, images
  • Recipes → ingredients, instructions, prep time, servings
  • Job postings → title, company, salary, location
  • Events → name, date, location, description
  • Any webpage → metadata, content, links, images

How It Works

  1. Fetch the HTML
  2. Auto-detect content type using Open Graph tags, Schema.org markup, and DOM signals
  3. Score content blocks (Readability-style algorithm) to find the main content
  4. Extract structured data: metadata, headings, images, links, JSON-LD
  5. Return clean JSON

Quick Example

curl "https://george-the-developer--web-content-extractor-api.apify.actor/extract?url=https://techcrunch.com&token=YOUR_TOKEN"
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "url": "https://techcrunch.com/article",
  "type": "article",
  "metadata": {
    "title": "AI Agents Are Reshaping Enterprise Software",
    "author": "Sarah Perez",
    "date": "2026-03-24",
    "siteName": "TechCrunch"
  },
  "content": {
    "text": "The rise of AI agents represents...",
    "headings": [{"level": 2, "text": "What Are AI Agents?"}],
    "wordCount": 2847
  }
}
Enter fullscreen mode Exit fullscreen mode

Why I Built This

I maintain 35+ data extraction APIs on Apify. The most common request I get: "I just need the main content from this URL as JSON."

Every developer building RAG pipelines, content aggregators, or AI agents needs this. But existing solutions are either:

  • Too expensive (Diffbot = $299/month)
  • Too complex (custom scrapers = days of work)
  • Too slow (most take 5-30 seconds)

This does it in 1-3 seconds for $0.003 per extraction.

Use Cases

  1. RAG Pipelines — Feed clean text into vector databases for AI retrieval
  2. News Aggregation — Pull articles from 100+ sources into structured data
  3. Competitive Intelligence — Monitor competitor product pages
  4. Content Repurposing — Extract blog posts to repurpose across channels

Batch Processing

Need multiple URLs? The /batch endpoint handles up to 25 URLs in parallel:

POST /batch
{
  "urls": ["https://url1.com", "https://url2.com"],
  "format": "article"
}
Enter fullscreen mode Exit fullscreen mode

Try It

  • Apify Store: Web Content Extractor API
  • Also on RapidAPI: Search "Web Content Extractor"
  • Pricing: $0.003/extraction (PPE — only pay for what you use)

Built with Node.js + Cheerio on Apify Standby (instant HTTP response, no queue).

Questions? Hit me up @ai_in_it on X.

Top comments (0)