DEV Community

Kushan Randika Herath
Kushan Randika Herath

Posted on

πŸš€ Stop Writing Scrapers β€” I Built a Web Data Extractor API with Puppeteer (Full Code)

Scraping websites is one of the most annoying things in development.

❌ Every site has different HTML
❌ JavaScript-heavy pages break your scraper
❌ You get blocked randomly
❌ You waste hours fixing selectors

So I decided to solve this once and for all πŸ‘‡

πŸ”₯ What I Built

I built an AI Web Data Extractor API using:

Node.js
Puppeteer
Axios + Cheerio

πŸ‘‰ It extracts structured data from ANY URL:

πŸ›’ Product data (title, price, image)
πŸ“§ Emails from pages
πŸ“° Articles (title, content, author)

And the best part:

It automatically switches between fast scraping and browser scraping.

⚑ Live API

πŸ‘‰ Try it here:
https://rapidapi.com/kushanherath59/api/ai-web-data-extractor-api

🧠 How It Works
Step 1 β€” Try Fast Mode (Axios + Cheerio)
import axios from "axios";
import cheerio from "cheerio";

export async function fetchStatic(url) {
const res = await axios.get(url, {
headers: {
"User-Agent": "Mozilla/5.0"
}
});

return cheerio.load(res.data);
}

πŸ‘‰ This is fast ⚑ and cheap.

Step 2 β€” Fallback to Puppeteer (for JS-heavy sites)
import puppeteer from "puppeteer";

export async function fetchBrowser(url) {
const browser = await puppeteer.launch({
args: ["--no-sandbox"],
headless: "new"
});

const page = await browser.newPage();
await page.goto(url, { waitUntil: "domcontentloaded" });

const html = await page.content();

await browser.close();
return html;
}

πŸ‘‰ This handles sites like:

AliExpress
Amazon
modern React apps
Step 3 β€” Smart Extraction

Example: product extractor

export function extractProduct($) {
return {
title: $("h1").first().text().trim(),
price: $(".price, .product-price").first().text(),
image: $("img").first().attr("src")
};
}
Step 4 β€” Auto Fallback Logic
function isWeak(data) {
return !data.title || !data.price;
}

if (isWeak(result)) {
const html = await fetchBrowser(url);
const $ = cheerio.load(html);
result = extractProduct($);
}

πŸ‘‰ This is the secret sauce

πŸ§ͺ Example API Request
curl -X POST https://your-api-url/api/v1/extract \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"type": "product"
}'
πŸ“¦ Example Response
{
"title": "Wireless Headphones",
"price": 19.99,
"currency": "USD",
"image": "https://..."
}
πŸ’‘ Why This Is Powerful

Instead of writing scrapers like this:

document.querySelector(".price")

You just do:

POST /extract

And get clean JSON βœ…

πŸ”₯ Real Use Cases
SaaS builders
price tracking tools
lead generation systems
market research tools
AI pipelines
⚠️ Challenges I Faced
Dynamic rendering (fixed with Puppeteer)
Messy price formats (regex cleaning)
Anti-bot protection (still improving)
πŸš€ What’s Next

I’m planning to add:

Proxy rotation
CAPTCHA bypass
AI extraction (LLM-based parsing)
Smart page classification
πŸ™Œ Final Thoughts

If you’ve ever built a scraper, you know:

It’s not fun πŸ˜…

This API makes it:

simple
fast
scalable
πŸ”— Try It + Feedback

πŸ‘‰ https://rapidapi.com/kushanherath59/api/ai-web-data-extractor-api

Would love your feedback πŸ™Œ
What feature should I add next?

Top comments (1)

Collapse
 
kushan20070126 profile image
Kushan Randika Herath

Built this in 2 days πŸš€

Happy to share full repo if people are interested! github.com/Kushan20070126/ai-web-d...