Scraping websites is one of the most annoying things in development.
β Every site has different HTML
β JavaScript-heavy pages break your scraper
β You get blocked randomly
β You waste hours fixing selectors
So I decided to solve this once and for all π
π₯ What I Built
I built an AI Web Data Extractor API using:
Node.js
Puppeteer
Axios + Cheerio
π It extracts structured data from ANY URL:
π Product data (title, price, image)
π§ Emails from pages
π° Articles (title, content, author)
And the best part:
It automatically switches between fast scraping and browser scraping.
β‘ Live API
π Try it here:
https://rapidapi.com/kushanherath59/api/ai-web-data-extractor-api
π§ How It Works
Step 1 β Try Fast Mode (Axios + Cheerio)
import axios from "axios";
import cheerio from "cheerio";
export async function fetchStatic(url) {
const res = await axios.get(url, {
headers: {
"User-Agent": "Mozilla/5.0"
}
});
return cheerio.load(res.data);
}
π This is fast β‘ and cheap.
Step 2 β Fallback to Puppeteer (for JS-heavy sites)
import puppeteer from "puppeteer";
export async function fetchBrowser(url) {
const browser = await puppeteer.launch({
args: ["--no-sandbox"],
headless: "new"
});
const page = await browser.newPage();
await page.goto(url, { waitUntil: "domcontentloaded" });
const html = await page.content();
await browser.close();
return html;
}
π This handles sites like:
AliExpress
Amazon
modern React apps
Step 3 β Smart Extraction
Example: product extractor
export function extractProduct($) {
return {
title: $("h1").first().text().trim(),
price: $(".price, .product-price").first().text(),
image: $("img").first().attr("src")
};
}
Step 4 β Auto Fallback Logic
function isWeak(data) {
return !data.title || !data.price;
}
if (isWeak(result)) {
const html = await fetchBrowser(url);
const $ = cheerio.load(html);
result = extractProduct($);
}
π This is the secret sauce
π§ͺ Example API Request
curl -X POST https://your-api-url/api/v1/extract \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"type": "product"
}'
π¦ Example Response
{
"title": "Wireless Headphones",
"price": 19.99,
"currency": "USD",
"image": "https://..."
}
π‘ Why This Is Powerful
Instead of writing scrapers like this:
document.querySelector(".price")
You just do:
POST /extract
And get clean JSON β
π₯ Real Use Cases
SaaS builders
price tracking tools
lead generation systems
market research tools
AI pipelines
β οΈ Challenges I Faced
Dynamic rendering (fixed with Puppeteer)
Messy price formats (regex cleaning)
Anti-bot protection (still improving)
π Whatβs Next
Iβm planning to add:
Proxy rotation
CAPTCHA bypass
AI extraction (LLM-based parsing)
Smart page classification
π Final Thoughts
If youβve ever built a scraper, you know:
Itβs not fun π
This API makes it:
simple
fast
scalable
π Try It + Feedback
π https://rapidapi.com/kushanherath59/api/ai-web-data-extractor-api
Would love your feedback π
What feature should I add next?
Top comments (1)
Built this in 2 days π
Happy to share full repo if people are interested! github.com/Kushan20070126/ai-web-d...