One honojs file for entire web scraping API

#webdev #programming

Building scrape endpoints using one honojs file

Tags: Web Scraping, Github Repository, Scraping, Backend, Honojs

Hey there!!

Welcome to the new blog

Web scraping is always in demand and will be in demand. I've checked tonnes of APIs, including FREE and Paid APIs for web scraping.

SERP API, Google API, and ScrapingBee all platforms provide APIs for web scraping, but in today's story, we will be building one of our own web scraping APIs

I'll also provide github repository in the end, which you can use easily to launch your own scraping APIs on vercel, Cloudflare, netlify or, fly.io or even on a Docker container

How does it start?

I was working on inkgest.com, and we need scraping APIs to scrape the website content legitimately. The above provides options that have a few edge cases, each costs a couple of $$ for just 100 requests per month, and that is not a feasible option

Inkgest needs to make tonnes of API requests to scrape or do web scraping, and for that, relying on third-party wasn't a good solution, forcing us to make our own scraping APIs

Why Honojs?

We have a couple of options for simple open-source web scraping APIs as a framework to choose from: FastAPI or NestJS. Anything works well, but I want to choose something fast, small in size and easy to deploy anywhere using Docker, so I prefer honojs with bun, which is a new tech stack but a faster one.

What is Web Scraping?

To newcomers, web scraping is a simple API that simulates a browser, opens a website URL or link, and scrapes the content.

It does more, or can do more, because every scraping algorithm basically uses the Chrome browser, headless or head, except if you are using the LightPanda browser SDK, which does this without using headless Chrome, making it faster than ever before.

Web scraping is nothing more than an API endpoint whose task is to create a browser instance, load the URL or website, scrape the content or read the DOM elements, and finally parse the content in a human-readable format.

Packages for Web Scraping?

A few packages or SDKs that are open-sourced are used in web scraping, as listed below

Puppeteer
Playwright
Cheerio
Crawlee
Lightpanda browser

A lot more options we have, one needs to do a good Google search

How does a Web Scraping API work?

Loads the browser instance
Go to the URL page or web link
Read the DOM elements (Wait for elements/webpage to load)
Parse the DOM elements, read the content

Why Scraping is Hard or Illegal?

Web scraping is not illegal, but websites have the authority to block the scrapers scraping the website, including Reddit, X, Substack or Medium.

Scraping is not allowed for most of the websites; each website has a robots.txt file providing instructions for the web scraper, and each web scraper needs to follow those instructions.

We can't simply scrape every content, every time, from any webpage; we will hit the rate limit, scraper bot detectors, or get IP blocked.

That is why web scraping is not an easy one; it needs an art to do the following

Rotate IP addresses
Modify the user agents in our scraping API for the browser instance
Handling cookies, overlays or modals for the website before scraping

A lot to be done for scraping a website's content and handling those API endpoints on your own won't cost you much, but it will costs the time

Simple Scraping API

Below is the code for a simple web scraping API

/**
 * Minimal Hono API: POST /scrape + POST /screenshot
 * Pattern aligned with ihatereading-api: BrowserPool → puppeteer-core →
 * request interception, stealth-ish evaluateOnNewDocument, JSDOM cleanup, markdown.
 *
 * Env:
 *   BROWSER_POOL_SIZE=2
 *   CHROME_PATH=/path/to/chrome   (local dev; optional if @sparticuz/chromium works)
 */
import { serve } from "@hono/node-server";
import { Hono } from "hono";
import { JSDOM } from "jsdom";
import TurndownService from "turndown";

const POOL_SIZE = Math.max(1, parseInt(process.env.BROWSER_POOL_SIZE || "2", 10) || 2);

const CHROME_ARGS = [
    "--no-sandbox",
    "--disable-setuid-sandbox",
    "--disable-dev-shm-usage",
    "--disable-gpu",
    "--disable-web-security",
    "--no-zygote",
    "--single-process",
];

function isValidHttpUrl(s) {
    try {
        const u = new URL(String(s).trim());
        return u.protocol === "http:" || u.protocol === "https:";
    } catch {
        return false;
    }
}

/** In-memory sliding-window rate limit (same idea as index.js). */
const rateLimitMap = new Map();
function rateLimit(ip, limit, windowMs) {
    const now = Date.now();
    const record = rateLimitMap.get(ip);
    if (!record || now > record.resetTime) {
        rateLimitMap.set(ip, { count: 1, resetTime: now + windowMs });
        return { allowed: true, remaining: limit - 1 };
    }
    if (record.count >= limit) {
        return {
            allowed: false,
            retryAfter: Math.ceil((record.resetTime - now) / 1000),
            remaining: 0,
        };
    }
    record.count++;
    return { allowed: true, remaining: limit - record.count };
}

setInterval(() => {
    const now = Date.now();
    for (const [ip, r] of rateLimitMap.entries()) {
        if (now > r.resetTime) rateLimitMap.delete(ip);
    }
}, 5 * 60 * 1000);

class BrowserPool {
    constructor(size) {
        this._size = size;
        this._pool = [];
        this._queue = [];
        this._ready = false;
        this._loading = false;
    }

    async _launch() {
        const puppeteer = (await import("puppeteer-core")).default;
        const chromium = (await import("@sparticuz/chromium")).default;

        const chromePath = process.env.CHROME_PATH?.trim();
        if (chromePath) {
            return puppeteer.launch({
                headless: true,
                executablePath: chromePath,
                args: CHROME_ARGS,
            });
        }
        try {
            const executablePath = await chromium.executablePath();
            return puppeteer.launch({
                headless: true,
                executablePath,
                args: [...chromium.args, "--disable-web-security"],
                ignoreDefaultArgs: ["--disable-extensions"],
            });
        } catch {
            throw new Error(
                "Could not launch Chrome. Set CHROME_PATH to your Chrome/Chromium binary.",
            );
        }
    }

    async initialise() {
        if (this._ready || this._loading) return;
        this._loading = true;
        const browsers = await Promise.all(
            Array.from({ length: this._size }, () => this._launch()),
        );
        this._pool = browsers.map((browser, index) => ({
            browser,
            busy: false,
            index,
        }));
        this._ready = true;
        this._loading = false;
    }

    _acquire() {
        const free = this._pool.find((e) => !e.busy);
        if (free) {
            free.busy = true;
            return Promise.resolve(free);
        }
        return new Promise((resolve) => this._queue.push(resolve));
    }

    _release(entry) {
        entry.busy = false;
        const next = this._queue.shift();
        if (next) {
            const f = this._pool.find((e) => !e.busy);
            if (f) {
                f.busy = true;
                next(f);
            } else this._queue.unshift(next);
        }
    }

    async withPage(fn) {
        if (!this._ready) await this.initialise();
        const entry = await this._acquire();
        let page;
        try {
            page = await entry.browser.newPage();
            return await fn(page);
        } finally {
            if (page) {
                try {
                    await page.close();
                } catch {}
            }
            this._release(entry);
        }
    }

    get stats() {
        if (!this._pool.length) {
            return { size: 0, busy: 0, free: 0, queued: this._queue.length };
        }
        const busy = this._pool.filter((e) => e.busy).length;
        return {
            size: this._pool.length,
            busy,
            free: this._pool.length - busy,
            queued: this._queue.length,
        };
    }
}

const browserPool = new BrowserPool(POOL_SIZE);

const turndown = new TurndownService({
    headingStyle: "atx",
    codeBlockStyle: "fenced",
});

const NOISE_SELECTORS = [
    "script",
    "style",
    "noscript",
    "header",
    "footer",
    "nav",
    "aside",
    ".ad",
    ".ads",
    "[class*='cookie' i]",
    "[id*='cookie' i]",
].join(", ");

function stripDomNoise(document) {
    document.querySelectorAll(NOISE_SELECTORS).forEach((el) => el.remove());
}

async function setupPage(page, { timeout, viewport }) {
    await page.setViewport(viewport);
    await page.setUserAgent(
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    );
    await page.setExtraHTTPHeaders({
        Accept:
            "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    });

    await page.evaluateOnNewDocument(() => {
        Object.defineProperty(navigator, "webdriver", { get: () => undefined });
        Object.defineProperty(navigator, "plugins", { get: () => [1, 2, 3, 4, 5] });
        Object.defineProperty(navigator, "languages", { get: () => ["en-US", "en"] });
    });

    await page.setRequestInterception(true);
    page.on("request", (req) => {
        const type = req.resourceType();
        const url = req.url().toLowerCase();
        if (type === "image" || type === "font" || type === "media") {
            req.abort();
            return;
        }
        if (type === "stylesheet") {
            req.respond({ status: 200, contentType: "text/css", body: "" });
            return;
        }
        if (
            url.includes("cloudflare") ||
            url.includes("challenge") ||
            url.includes("bot-detection")
        ) {
            req.abort();
            return;
        }
        req.continue();
    });

    page.setDefaultNavigationTimeout(timeout);
    page.setDefaultTimeout(timeout);
}

/**
 * Core scrape — simplified from scrapeSingleUrlWithPuppeteer (no Firestore, proxy, Reddit, G2, AI).
 */
async function scrapeUrl(url, options = {}) {
    const {
        waitForSelector = null,
        timeout = 30_000,
        includeSemanticContent = true,
        includeImages = true,
        includeLinks = true,
        extractMetadata = true,
    } = options;

    const viewport = { width: 1366, height: 768, deviceScaleFactor: 1 };

    return browserPool.withPage(async (page) => {
        await setupPage(page, { timeout, viewport });

        await page.goto(url, { waitUntil: "domcontentloaded", timeout });
        if (waitForSelector) {
            try {
                await page.waitForSelector(waitForSelector, { timeout: 10_000 });
            } catch {}
        }

        const scrapedData = includeSemanticContent
            ? await page.evaluate(
                    (opts) => {
                        const data = {
                            url: location.href,
                            title: document.title,
                            content: {},
                            metadata: {},
                            links: [],
                            images: [],
                        };
                        ["h1", "h2", "h3", "h4", "h5", "h6"].forEach((tag) => {
                            data.content[tag] = Array.from(
                                document.querySelectorAll(tag),
                            ).map((h) => h.textContent.trim());
                        });
                        if (opts.extractMetadata) {
                            document.querySelectorAll("meta").forEach((meta) => {
                                const name =
                                    meta.getAttribute("name") || meta.getAttribute("property");
                                const content = meta.getAttribute("content");
                                if (name && content) data.metadata[name] = content;
                            });
                        }
                        if (opts.includeLinks) {
                            const host = location.hostname;
                            const seen = new Set();
                            data.links = Array.from(document.querySelectorAll("a[href]"))
                                .map((a) => ({
                                    text: a.textContent.trim(),
                                    href: a.href,
                                    title: a.getAttribute("title") || "",
                                }))
                                .filter((l) => {
                                    try {
                                        if (new URL(l.href).hostname !== host) return false;
                                    } catch {
                                        return false;
                                    }
                                    if (!l.text && !l.title) return false;
                                    const k = `${l.text}|${l.href}`;
                                    if (seen.has(k)) return false;
                                    seen.add(k);
                                    return true;
                                });
                        }
                        if (opts.includeImages) {
                            data.images = Array.from(document.querySelectorAll("img[src]"))
                                .filter((img) => !img.src.startsWith("data:"))
                                .map((img) => ({
                                    src: img.src,
                                    alt: img.alt || "",
                                }));
                        }
                        return data;
                    },
                    { extractMetadata, includeLinks, includeImages },
                )
            : { url, title: await page.title(), content: {}, metadata: {}, links: [], images: [] };

        const html = await page.content();
        const dom = new JSDOM(html);
        const doc = dom.window.document;
        stripDomNoise(doc);
        const markdown = turndown.turndown(doc.body);

        return {
            success: true,
            data: scrapedData,
            markdown,
        };
    });
}

async function screenshotUrl(url, options = {}) {
    const {
        timeout = 30_000,
        fullPage = true,
        waitForSelector = null,
    } = options;
    const viewport = { width: 1366, height: 768, deviceScaleFactor: 1 };

    return browserPool.withPage(async (page) => {
        await setupPage(page, { timeout, viewport });
        await page.goto(url, { waitUntil: "domcontentloaded", timeout });
        if (waitForSelector) {
            try {
                await page.waitForSelector(waitForSelector, { timeout: 10_000 });
            } catch {}
        }
        const buf = await page.screenshot({
            type: "png",
            fullPage,
            encoding: "binary",
        });
        return Buffer.from(buf).toString("base64");
    });
}

const app = new Hono();

app.get("/", (c) =>
    c.json({
        ok: true,
        hint: "POST /scrape or /screenshot with JSON { url, ... }",
    }),
);

app.post("/scrape", async (c) => {
    const ip =
        c.req.header("x-forwarded-for")?.split(",")[0]?.trim() ||
        c.req.header("x-real-ip") ||
        "unknown";
    const rl = rateLimit(ip, 30, 10 * 60 * 1000);
    if (!rl.allowed) {
        c.header("Retry-After", String(rl.retryAfter));
        return c.json({ error: "rate_limited", retryAfter: rl.retryAfter }, 429);
    }

    let body = {};
    try {
        body = await c.req.json();
    } catch {
        return c.json({ error: "invalid_json" }, 400);
    }

    const { url } = body;
    if (!url || !isValidHttpUrl(url)) {
        return c.json({ error: "valid https url required" }, 400);
    }

    try {
        const result = await scrapeUrl(url, {
            waitForSelector: body.waitForSelector ?? null,
            timeout: Number(body.timeout) || 30_000,
            includeSemanticContent: body.includeSemanticContent !== false,
            includeImages: body.includeImages !== false,
            includeLinks: body.includeLinks !== false,
            extractMetadata: body.extractMetadata !== false,
        });
        return c.json({
            ...result,
            url,
            timestamp: new Date().toISOString(),
            poolStats: browserPool.stats,
        });
    } catch (e) {
        console.error(e);
        return c.json(
            { success: false, error: e?.message || "scrape_failed", url },
            500,
        );
    }
});

app.post("/screenshot", async (c) => {
    const ip =
        c.req.header("x-forwarded-for")?.split(",")[0]?.trim() ||
        c.req.header("x-real-ip") ||
        "unknown";
    const rl = rateLimit(ip, 20, 10 * 60 * 1000);
    if (!rl.allowed) {
        c.header("Retry-After", String(rl.retryAfter));
        return c.json({ error: "rate_limited" }, 429);
    }

    let body = {};
    try {
        body = await c.req.json();
    } catch {
        return c.json({ error: "invalid_json" }, 400);
    }

    const { url } = body;
    if (!url || !isValidHttpUrl(url)) {
        return c.json({ error: "valid https url required" }, 400);
    }

    try {
        const b64 = await screenshotUrl(url, {
            timeout: Number(body.timeout) || 30_000,
            fullPage: body.fullPage !== false,
            waitForSelector: body.waitForSelector ?? null,
        });
        return c.json({
            success: true,
            url,
            image_base64: b64,
            mime: "image/png",
            timestamp: new Date().toISOString(),
            poolStats: browserPool.stats,
        });
    } catch (e) {
        console.error(e);
        return c.json({ success: false, error: e?.message || "screenshot_failed" }, 500);
    }
});

const port = Number(process.env.PORT) || 3000;
console.log(`Listening on http://localhost:${port}`);
serve({ fetch: app.fetch, port });

A few things to note down

We are using a small version of Playwright, @sparticuz/chromium, as it works on Vercel Edge as well
For making API work on Cloudflare, use Playwright or Puppeteer
Turndown npm is used to parse the content into markdown

Now using one simple method as an API, our web scraping API is ready

Deploying in Production

Let's be honest, this API works for personal use cases but not in production for mass users. So our next task is to make it production-ready

Create a pool for the browser instances, which works well for multiple URL web scraping, using BrowserPool
Add rate limiting to the API endpoint
URL validation layer
Caching for already scraped URL content, no need to rerun the API, save costs and time
Lastly, add some noise selectors to avoid while scraping the content, which helps the API to avoid loading unwanted scripts such as Google Analytics or ads
IP rotations and user agent proxies to avoid getting banned while scraping too much

The above points are added to the code, making the API a robust production-grade level scraping API.

Take Screenshot & AI Summary

The same API helps to take a screenshot if the takeScreenshot parameter is passed as true in the API

For a screenshot, we load the webpage and wait for a few seconds, and then use the playwright page.screenshot method to take the screenshot, quite an easy one.

Another thing to mention is aiSummary, using an AI LLM model, we can generate an AI summary for every webpage by passing the scraped content into an AI LLM to generate the summary. This is quite interesting and useful

Check the Github repository

That would be enough for today. If you need anything else, feel free to email me.

Cheers

Shrey