Yue Geng

Posted on May 12 • Originally published at gengyue.site

Building a Lightweight Web Scraping Toy with Bun’s Experimental `Bun.Webview`

#webdev #bunjs #javascript #webview

Introduction

Starting from Bun v1.3.12, a new experimental API called Bun.Webview was introduced. It enables simple browser automation and can partially replace tools like Playwright. Pretty exciting, so I gave it a try.

For macOS users, Bun.Webview can directly use the system’s native WebKit as the backend. On Windows and Linux, Chrome can be used as the backend via:

const view = new Bun.WebView({ backend: "chrome" });

According to the Bun documentation, Bun searches for Chrome in the following order:

The path provided in backend: { type: "chrome", path: "..." }
The BUN_CHROME_PATH environment variable
$PATH (google-chrome-stable, google-chrome, chromium-browser, chromium, brave-browser, microsoft-edge, chrome)
Common installation directories
Playwright cache (~/Library/Caches/ms-playwright or ~/.cache/ms-playwright) for chrome-headless-shell

Integrating the Chrome Backend

In practice, I found that on Windows, no matter how I set BUN_CHROME_PATH, Bun often fails to correctly locate and launch Chrome—even when Chrome, Chromium, or Edge is installed.

I found a related issue in the Bun GitHub repository, which suggests this is still an early-stage limitation and will likely improve in future releases.

So I switched to another approach: manually launching Chrome with remote debugging enabled.

Chromium-based browsers support a remote debugging mode. For Chrome, it's available via chrome://inspect/#remote-debugging, and for Edge via edge://inspect/#remote-debugging.

In theory, enabling “Allow remote debugging” starts a server at 127.0.0.1:9222. However, on my laptop, although the server started, all expected endpoints returned 404—which was odd.

Eventually, I resolved it by manually launching Edge from the command line:

"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" --remote-debugging-port=9222

Now the debugging server works correctly.

Next, we need to connect Bun.Webview to this browser instance. We can fetch the WebSocket debugging URL like this:

import axios from "axios";

async function getBrowserDebuggingURL(): Promise<string> {
  try {
    const response = await axios.get("http://localhost:9222/json/version");
    return response.data.webSocketDebuggerUrl;
  } catch (error) {
    const message = error instanceof Error ? error.message : String(error);
    console.error(`Failed to get browser debugging URL: ${message}`);
    throw new Error("Failed to get browser debugging URL");
  }
}

export { getBrowserDebuggingURL };

Then pass it into Bun.Webview:

const view = new Bun.WebView({
  backend: {
    type: "chrome",
    url: await getBrowserDebuggingURL(),
  },
  headless: true,
});

At this point, Bun should successfully connect to the Chrome backend.

Web Scraping & Formatting

The Bun.Webview API is similar to Playwright. We can extract page data like this:

const title = await view.evaluate(`
  document.title
  || document.querySelector('meta[property="og:title"]')?.content
  || document.querySelector('meta[name="twitter:title"]')?.content
  || document.querySelector('h1')?.textContent?.trim()
  || document.querySelector('h2')?.textContent?.trim()
  || ""
`);

const html = await view.evaluate("document.documentElement.outerHTML");
const text = await view.evaluate("document.documentElement.innerText");

For processing, I built a custom parser using cheerio to clean up the DOM, removing unnecessary tags like script and style, keeping only the body, and then converting HTML into Markdown using @mizchi/readability.

This helps reduce token usage (and yes—save money 😄):

import { extract, toMarkdown } from "@mizchi/readability";
import * as cheerio from "cheerio";

function normalizeHtml(html: string) {
  try {
    const $ = cheerio.load(html);
    return $("body").html() ?? "";
  } catch (error) {
    console.warn("Failed to normalize HTML:", error);
    return html;
  }
}

async function htmlParser(url: string, html: string): Promise<string> {
  try {
    const normalizedHtml = normalizeHtml(html);
    const extracted = extract(normalizedHtml, {
      charThreshold: 100,
    });

    if (!extracted?.root) {
      console.warn(`No root element found: ${url}`);
      return "";
    }

    const parsed = toMarkdown(extracted.root);

    if (typeof parsed !== "string" || parsed.trim().length === 0) {
      console.warn(`Markdown conversion empty: ${url}`);
      return "";
    }

    return parsed;
  } catch (error) {
    const message = error instanceof Error ? error.message : String(error);
    console.error(`HTML parsing failed: ${message}`);
    return "";
  }
}

export { htmlParser };

Unfortunately, Markdown conversion often fails. My guess is that the library is designed for readability-mode pages, and some sites don’t support that structure. In such cases, I fallback to innerText.

A quick test via curl:

curl -X POST http://localhost:9233/read \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-magic-access-token" \
  -d '{"url":"https://www.gengyue.site"}'

Example output:

---
title: gengyue
url: https://www.gengyue.site
---

# Hi 👋!

...

User-Agent & Plugin System

Time to test real-world scraping: Zhihu, Xiaohongshu, and WeChat.

Interestingly, Zhihu and Xiaohongshu worked fine, but WeChat triggered anti-scraping protection.

So I tried a trick: spoofing the User-Agent.

Here are some presets:

export const UA_PRESETS = {
  iPhone_WebView: "...MicroMessenger/8.0.49",
  iPhone_Safari: "...Safari/604.1",
  Android_WebView: "...MicroMessenger/8.0.49",
  Android_Chrome: "...",
  Desktop_Chrome: "...",
  Desktop_Safari: "...",
  Baidu_Spider: "...",
  Googlebot: "...",
} as const;

Surprisingly, using an iPhone WebView UA bypassed WeChat restrictions.

We define a plugin:

const wechatPlugin = {
  name: "wechat",
  match(url: string): boolean {
    const hostname = new URL(url).hostname;
    return hostname.endsWith("mp.weixin.qq.com");
  },
  getUserAgent() {
    return UA_PRESETS.iPhone_WebView;
  },
};

Then apply it:

await view.cdp("Network.setUserAgentOverride", {
  userAgent: matchedUA,
});

Deployment

I originally built this for integration with a QQ bot, so I wrapped it into a simple HTTP backend using Bun + Hono.

Deployment is straightforward: clone the repo, install dependencies, and run with pm2.

On Ubuntu, install Chromium:

sudo apt update
sudo apt install -y ca-certificates fonts-liberation fonts-noto-cjk
sudo apt install -y chromium-browser

To make Chromium behave more like a real browser, I also installed xvfb for non-headless mode.

A systemd service:

[Unit]
Description=Chromium Browser
After=network.target

[Service]
ExecStart=/usr/bin/xvfb-run --auto-servernum --server-args="-screen 0 1920x1080x24" \
  /usr/bin/chromium-browser \
  --no-sandbox \
  --remote-debugging-port=9222 \
  --user-data-dir=/tmp/chrome-debug-profile \
  https://www.google.com

Restart=always
RestartSec=10

[Install]
WantedBy=default.target

Enable it:

systemctl --user daemon-reload
systemctl --user enable chromium.service
systemctl --user start chromium.service

Test again:

curl -X POST http://localhost:9233/read \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-magic-access-token" \
  -d '{"url":"https://www.gengyue.site"}'

Everything works 🎉

Closing Thoughts

This is still a rough experimental setup. It’s not production-grade, but it works surprisingly well as a lightweight scraping backend and can already power bots or automation workflows.

Source code: https://github.com/gengyue2468/fig

Original article published by myself in Chinese（简体中文）:
https://www.gengyue.site/blog/build-fig-via-bun-webview/