🚀 Stop Writing Scrapers — I Built a Web Data Extractor API with Puppeteer (Full Code)

Scraping websites is one of the most annoying things in development.

❌ Every site has different HTML
❌ JavaScript-heavy pages break your scraper
❌ You get blocked randomly
❌ You waste hours fixing selectors

So I decided to solve this once and for all 👇

🔥 What I Built

I built an AI Web Data Extractor API using:

Node.js
Puppeteer
Axios + Cheerio

👉 It extracts structured data from ANY URL:

🛒 Product data (title, price, image)
📧 Emails from pages
📰 Articles (title, content, author)

And the best part:

It automatically switches between fast scraping and browser scraping.

⚡ Live API

👉 Try it here:
https://rapidapi.com/kushanherath59/api/ai-web-data-extractor-api

🧠 How It Works
Step 1 — Try Fast Mode (Axios + Cheerio)
import axios from "axios";
import cheerio from "cheerio";

export async function fetchStatic(url) {
const res = await axios.get(url, {
headers: {
"User-Agent": "Mozilla/5.0"
}
});

return cheerio.load(res.data);
}

👉 This is fast ⚡ and cheap.

Step 2 — Fallback to Puppeteer (for JS-heavy sites)
import puppeteer from "puppeteer";

export async function fetchBrowser(url) {
const browser = await puppeteer.launch({
args: ["--no-sandbox"],
headless: "new"
});

const page = await browser.newPage();
await page.goto(url, { waitUntil: "domcontentloaded" });

const html = await page.content();

await browser.close();
return html;
}

👉 This handles sites like:

AliExpress
Amazon
modern React apps
Step 3 — Smart Extraction

Example: product extractor

export function extractProduct($) {
return {
title: $("h1").first().text().trim(),
price: $(".price, .product-price").first().text(),
image: $("img").first().attr("src")
};
}
Step 4 — Auto Fallback Logic
function isWeak(data) {
return !data.title || !data.price;
}

if (isWeak(result)) {
const html = await fetchBrowser(url);
const $ = cheerio.load(html);
result = extractProduct($);
}

👉 This is the secret sauce

🧪 Example API Request
curl -X POST https://your-api-url/api/v1/extract \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"type": "product"
}'
📦 Example Response
{
"title": "Wireless Headphones",
"price": 19.99,
"currency": "USD",
"image": "https://..."
}
💡 Why This Is Powerful

Instead of writing scrapers like this:

document.querySelector(".price")

You just do:

POST /extract

And get clean JSON ✅

🔥 Real Use Cases
SaaS builders
price tracking tools
lead generation systems
market research tools
AI pipelines
⚠️ Challenges I Faced
Dynamic rendering (fixed with Puppeteer)
Messy price formats (regex cleaning)
Anti-bot protection (still improving)
🚀 What’s Next

I’m planning to add:

Proxy rotation
CAPTCHA bypass
AI extraction (LLM-based parsing)
Smart page classification
🙌 Final Thoughts

If you’ve ever built a scraper, you know:

It’s not fun 😅

This API makes it:

simple
fast
scalable
🔗 Try It + Feedback

👉 https://rapidapi.com/kushanherath59/api/ai-web-data-extractor-api

Would love your feedback 🙌
What feature should I add next?