Building scrape endpoints using one honojs file
Tags: Web Scraping, Github Repository, Scraping, Backend, Honojs
Hey there!!
Welcome to the new blog
Web scraping is always in demand and will be in demand. I've checked tonnes of APIs, including FREE and Paid APIs for web scraping.
SERP API, Google API, and ScrapingBee all platforms provide APIs for web scraping, but in today's story, we will be building one of our own web scraping APIs
I'll also provide github repository in the end, which you can use easily to launch your own scraping APIs on vercel, Cloudflare, netlify or, fly.io or even on a Docker container
How does it start?
I was working on inkgest.com, and we need scraping APIs to scrape the website content legitimately. The above provides options that have a few edge cases, each costs a couple of $$ for just 100 requests per month, and that is not a feasible option
Inkgest needs to make tonnes of API requests to scrape or do web scraping, and for that, relying on third-party wasn't a good solution, forcing us to make our own scraping APIs
Why Honojs?
We have a couple of options for simple open-source web scraping APIs as a framework to choose from: FastAPI or NestJS. Anything works well, but I want to choose something fast, small in size and easy to deploy anywhere using Docker, so I prefer honojs with bun, which is a new tech stack but a faster one.
What is Web Scraping?
To newcomers, web scraping is a simple API that simulates a browser, opens a website URL or link, and scrapes the content.
It does more, or can do more, because every scraping algorithm basically uses the Chrome browser, headless or head, except if you are using the LightPanda browser SDK, which does this without using headless Chrome, making it faster than ever before.
Web scraping is nothing more than an API endpoint whose task is to create a browser instance, load the URL or website, scrape the content or read the DOM elements, and finally parse the content in a human-readable format.
Packages for Web Scraping?
A few packages or SDKs that are open-sourced are used in web scraping, as listed below
- Puppeteer
- Playwright
- Cheerio
- Crawlee
- Lightpanda browser
A lot more options we have, one needs to do a good Google search
How does a Web Scraping API work?
- Loads the browser instance
- Go to the URL page or web link
- Read the DOM elements (Wait for elements/webpage to load)
- Parse the DOM elements, read the content
Why Scraping is Hard or Illegal?
Web scraping is not illegal, but websites have the authority to block the scrapers scraping the website, including Reddit, X, Substack or Medium.
Scraping is not allowed for most of the websites; each website has a robots.txt file providing instructions for the web scraper, and each web scraper needs to follow those instructions.
We can't simply scrape every content, every time, from any webpage; we will hit the rate limit, scraper bot detectors, or get IP blocked.
That is why web scraping is not an easy one; it needs an art to do the following
- Rotate IP addresses
- Modify the user agents in our scraping API for the browser instance
- Handling cookies, overlays or modals for the website before scraping
A lot to be done for scraping a website's content and handling those API endpoints on your own won't cost you much, but it will costs the time
Simple Scraping API
Below is the code for a simple web scraping API
/**
* Minimal Hono API: POST /scrape + POST /screenshot
* Pattern aligned with ihatereading-api: BrowserPool → puppeteer-core →
* request interception, stealth-ish evaluateOnNewDocument, JSDOM cleanup, markdown.
*
* Env:
* BROWSER_POOL_SIZE=2
* CHROME_PATH=/path/to/chrome (local dev; optional if @sparticuz/chromium works)
*/
import { serve } from "@hono/node-server";
import { Hono } from "hono";
import { JSDOM } from "jsdom";
import TurndownService from "turndown";
const POOL_SIZE = Math.max(1, parseInt(process.env.BROWSER_POOL_SIZE || "2", 10) || 2);
const CHROME_ARGS = [
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
"--disable-gpu",
"--disable-web-security",
"--no-zygote",
"--single-process",
];
function isValidHttpUrl(s) {
try {
const u = new URL(String(s).trim());
return u.protocol === "http:" || u.protocol === "https:";
} catch {
return false;
}
}
/** In-memory sliding-window rate limit (same idea as index.js). */
const rateLimitMap = new Map();
function rateLimit(ip, limit, windowMs) {
const now = Date.now();
const record = rateLimitMap.get(ip);
if (!record || now > record.resetTime) {
rateLimitMap.set(ip, { count: 1, resetTime: now + windowMs });
return { allowed: true, remaining: limit - 1 };
}
if (record.count >= limit) {
return {
allowed: false,
retryAfter: Math.ceil((record.resetTime - now) / 1000),
remaining: 0,
};
}
record.count++;
return { allowed: true, remaining: limit - record.count };
}
setInterval(() => {
const now = Date.now();
for (const [ip, r] of rateLimitMap.entries()) {
if (now > r.resetTime) rateLimitMap.delete(ip);
}
}, 5 * 60 * 1000);
class BrowserPool {
constructor(size) {
this._size = size;
this._pool = [];
this._queue = [];
this._ready = false;
this._loading = false;
}
async _launch() {
const puppeteer = (await import("puppeteer-core")).default;
const chromium = (await import("@sparticuz/chromium")).default;
const chromePath = process.env.CHROME_PATH?.trim();
if (chromePath) {
return puppeteer.launch({
headless: true,
executablePath: chromePath,
args: CHROME_ARGS,
});
}
try {
const executablePath = await chromium.executablePath();
return puppeteer.launch({
headless: true,
executablePath,
args: [...chromium.args, "--disable-web-security"],
ignoreDefaultArgs: ["--disable-extensions"],
});
} catch {
throw new Error(
"Could not launch Chrome. Set CHROME_PATH to your Chrome/Chromium binary.",
);
}
}
async initialise() {
if (this._ready || this._loading) return;
this._loading = true;
const browsers = await Promise.all(
Array.from({ length: this._size }, () => this._launch()),
);
this._pool = browsers.map((browser, index) => ({
browser,
busy: false,
index,
}));
this._ready = true;
this._loading = false;
}
_acquire() {
const free = this._pool.find((e) => !e.busy);
if (free) {
free.busy = true;
return Promise.resolve(free);
}
return new Promise((resolve) => this._queue.push(resolve));
}
_release(entry) {
entry.busy = false;
const next = this._queue.shift();
if (next) {
const f = this._pool.find((e) => !e.busy);
if (f) {
f.busy = true;
next(f);
} else this._queue.unshift(next);
}
}
async withPage(fn) {
if (!this._ready) await this.initialise();
const entry = await this._acquire();
let page;
try {
page = await entry.browser.newPage();
return await fn(page);
} finally {
if (page) {
try {
await page.close();
} catch {}
}
this._release(entry);
}
}
get stats() {
if (!this._pool.length) {
return { size: 0, busy: 0, free: 0, queued: this._queue.length };
}
const busy = this._pool.filter((e) => e.busy).length;
return {
size: this._pool.length,
busy,
free: this._pool.length - busy,
queued: this._queue.length,
};
}
}
const browserPool = new BrowserPool(POOL_SIZE);
const turndown = new TurndownService({
headingStyle: "atx",
codeBlockStyle: "fenced",
});
const NOISE_SELECTORS = [
"script",
"style",
"noscript",
"header",
"footer",
"nav",
"aside",
".ad",
".ads",
"[class*='cookie' i]",
"[id*='cookie' i]",
].join(", ");
function stripDomNoise(document) {
document.querySelectorAll(NOISE_SELECTORS).forEach((el) => el.remove());
}
async function setupPage(page, { timeout, viewport }) {
await page.setViewport(viewport);
await page.setUserAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
);
await page.setExtraHTTPHeaders({
Accept:
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
});
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, "webdriver", { get: () => undefined });
Object.defineProperty(navigator, "plugins", { get: () => [1, 2, 3, 4, 5] });
Object.defineProperty(navigator, "languages", { get: () => ["en-US", "en"] });
});
await page.setRequestInterception(true);
page.on("request", (req) => {
const type = req.resourceType();
const url = req.url().toLowerCase();
if (type === "image" || type === "font" || type === "media") {
req.abort();
return;
}
if (type === "stylesheet") {
req.respond({ status: 200, contentType: "text/css", body: "" });
return;
}
if (
url.includes("cloudflare") ||
url.includes("challenge") ||
url.includes("bot-detection")
) {
req.abort();
return;
}
req.continue();
});
page.setDefaultNavigationTimeout(timeout);
page.setDefaultTimeout(timeout);
}
/**
* Core scrape — simplified from scrapeSingleUrlWithPuppeteer (no Firestore, proxy, Reddit, G2, AI).
*/
async function scrapeUrl(url, options = {}) {
const {
waitForSelector = null,
timeout = 30_000,
includeSemanticContent = true,
includeImages = true,
includeLinks = true,
extractMetadata = true,
} = options;
const viewport = { width: 1366, height: 768, deviceScaleFactor: 1 };
return browserPool.withPage(async (page) => {
await setupPage(page, { timeout, viewport });
await page.goto(url, { waitUntil: "domcontentloaded", timeout });
if (waitForSelector) {
try {
await page.waitForSelector(waitForSelector, { timeout: 10_000 });
} catch {}
}
const scrapedData = includeSemanticContent
? await page.evaluate(
(opts) => {
const data = {
url: location.href,
title: document.title,
content: {},
metadata: {},
links: [],
images: [],
};
["h1", "h2", "h3", "h4", "h5", "h6"].forEach((tag) => {
data.content[tag] = Array.from(
document.querySelectorAll(tag),
).map((h) => h.textContent.trim());
});
if (opts.extractMetadata) {
document.querySelectorAll("meta").forEach((meta) => {
const name =
meta.getAttribute("name") || meta.getAttribute("property");
const content = meta.getAttribute("content");
if (name && content) data.metadata[name] = content;
});
}
if (opts.includeLinks) {
const host = location.hostname;
const seen = new Set();
data.links = Array.from(document.querySelectorAll("a[href]"))
.map((a) => ({
text: a.textContent.trim(),
href: a.href,
title: a.getAttribute("title") || "",
}))
.filter((l) => {
try {
if (new URL(l.href).hostname !== host) return false;
} catch {
return false;
}
if (!l.text && !l.title) return false;
const k = `${l.text}|${l.href}`;
if (seen.has(k)) return false;
seen.add(k);
return true;
});
}
if (opts.includeImages) {
data.images = Array.from(document.querySelectorAll("img[src]"))
.filter((img) => !img.src.startsWith("data:"))
.map((img) => ({
src: img.src,
alt: img.alt || "",
}));
}
return data;
},
{ extractMetadata, includeLinks, includeImages },
)
: { url, title: await page.title(), content: {}, metadata: {}, links: [], images: [] };
const html = await page.content();
const dom = new JSDOM(html);
const doc = dom.window.document;
stripDomNoise(doc);
const markdown = turndown.turndown(doc.body);
return {
success: true,
data: scrapedData,
markdown,
};
});
}
async function screenshotUrl(url, options = {}) {
const {
timeout = 30_000,
fullPage = true,
waitForSelector = null,
} = options;
const viewport = { width: 1366, height: 768, deviceScaleFactor: 1 };
return browserPool.withPage(async (page) => {
await setupPage(page, { timeout, viewport });
await page.goto(url, { waitUntil: "domcontentloaded", timeout });
if (waitForSelector) {
try {
await page.waitForSelector(waitForSelector, { timeout: 10_000 });
} catch {}
}
const buf = await page.screenshot({
type: "png",
fullPage,
encoding: "binary",
});
return Buffer.from(buf).toString("base64");
});
}
const app = new Hono();
app.get("/", (c) =>
c.json({
ok: true,
hint: "POST /scrape or /screenshot with JSON { url, ... }",
}),
);
app.post("/scrape", async (c) => {
const ip =
c.req.header("x-forwarded-for")?.split(",")[0]?.trim() ||
c.req.header("x-real-ip") ||
"unknown";
const rl = rateLimit(ip, 30, 10 * 60 * 1000);
if (!rl.allowed) {
c.header("Retry-After", String(rl.retryAfter));
return c.json({ error: "rate_limited", retryAfter: rl.retryAfter }, 429);
}
let body = {};
try {
body = await c.req.json();
} catch {
return c.json({ error: "invalid_json" }, 400);
}
const { url } = body;
if (!url || !isValidHttpUrl(url)) {
return c.json({ error: "valid https url required" }, 400);
}
try {
const result = await scrapeUrl(url, {
waitForSelector: body.waitForSelector ?? null,
timeout: Number(body.timeout) || 30_000,
includeSemanticContent: body.includeSemanticContent !== false,
includeImages: body.includeImages !== false,
includeLinks: body.includeLinks !== false,
extractMetadata: body.extractMetadata !== false,
});
return c.json({
...result,
url,
timestamp: new Date().toISOString(),
poolStats: browserPool.stats,
});
} catch (e) {
console.error(e);
return c.json(
{ success: false, error: e?.message || "scrape_failed", url },
500,
);
}
});
app.post("/screenshot", async (c) => {
const ip =
c.req.header("x-forwarded-for")?.split(",")[0]?.trim() ||
c.req.header("x-real-ip") ||
"unknown";
const rl = rateLimit(ip, 20, 10 * 60 * 1000);
if (!rl.allowed) {
c.header("Retry-After", String(rl.retryAfter));
return c.json({ error: "rate_limited" }, 429);
}
let body = {};
try {
body = await c.req.json();
} catch {
return c.json({ error: "invalid_json" }, 400);
}
const { url } = body;
if (!url || !isValidHttpUrl(url)) {
return c.json({ error: "valid https url required" }, 400);
}
try {
const b64 = await screenshotUrl(url, {
timeout: Number(body.timeout) || 30_000,
fullPage: body.fullPage !== false,
waitForSelector: body.waitForSelector ?? null,
});
return c.json({
success: true,
url,
image_base64: b64,
mime: "image/png",
timestamp: new Date().toISOString(),
poolStats: browserPool.stats,
});
} catch (e) {
console.error(e);
return c.json({ success: false, error: e?.message || "screenshot_failed" }, 500);
}
});
const port = Number(process.env.PORT) || 3000;
console.log(`Listening on http://localhost:${port}`);
serve({ fetch: app.fetch, port });
A few things to note down
- We are using a small version of Playwright, @sparticuz/chromium, as it works on Vercel Edge as well
- For making API work on Cloudflare, use Playwright or Puppeteer
- Turndown npm is used to parse the content into markdown
Now using one simple method as an API, our web scraping API is ready
Deploying in Production
Let's be honest, this API works for personal use cases but not in production for mass users. So our next task is to make it production-ready
- Create a pool for the browser instances, which works well for multiple URL web scraping, using BrowserPool
- Add rate limiting to the API endpoint
- URL validation layer
- Caching for already scraped URL content, no need to rerun the API, save costs and time
- Lastly, add some noise selectors to avoid while scraping the content, which helps the API to avoid loading unwanted scripts such as Google Analytics or ads
- IP rotations and user agent proxies to avoid getting banned while scraping too much
The above points are added to the code, making the API a robust production-grade level scraping API.
Take Screenshot & AI Summary
The same API helps to take a screenshot if the takeScreenshot parameter is passed as true in the API
For a screenshot, we load the webpage and wait for a few seconds, and then use the playwright page.screenshot method to take the screenshot, quite an easy one.
Another thing to mention is aiSummary, using an AI LLM model, we can generate an AI summary for every webpage by passing the scraped content into an AI LLM to generate the summary. This is quite interesting and useful
That would be enough for today. If you need anything else, feel free to email me.
Cheers
Shrey
Top comments (0)