DEV Community

Yue Geng
Yue Geng

Posted on • Originally published at gengyue.site

Building a Lightweight Web Scraping Toy with Bun’s Experimental `Bun.Webview`

Introduction

Starting from Bun v1.3.12, a new experimental API called Bun.Webview was introduced. It enables simple browser automation and can partially replace tools like Playwright. Pretty exciting, so I gave it a try.

For macOS users, Bun.Webview can directly use the system’s native WebKit as the backend. On Windows and Linux, Chrome can be used as the backend via:

const view = new Bun.WebView({ backend: "chrome" });
Enter fullscreen mode Exit fullscreen mode

According to the Bun documentation, Bun searches for Chrome in the following order:

  1. The path provided in backend: { type: "chrome", path: "..." }
  2. The BUN_CHROME_PATH environment variable
  3. $PATH (google-chrome-stable, google-chrome, chromium-browser, chromium, brave-browser, microsoft-edge, chrome)
  4. Common installation directories
  5. Playwright cache (~/Library/Caches/ms-playwright or ~/.cache/ms-playwright) for chrome-headless-shell

Integrating the Chrome Backend

In practice, I found that on Windows, no matter how I set BUN_CHROME_PATH, Bun often fails to correctly locate and launch Chrome—even when Chrome, Chromium, or Edge is installed.

I found a related issue in the Bun GitHub repository, which suggests this is still an early-stage limitation and will likely improve in future releases.

So I switched to another approach: manually launching Chrome with remote debugging enabled.

Chromium-based browsers support a remote debugging mode. For Chrome, it's available via chrome://inspect/#remote-debugging, and for Edge via edge://inspect/#remote-debugging.

In theory, enabling “Allow remote debugging” starts a server at 127.0.0.1:9222. However, on my laptop, although the server started, all expected endpoints returned 404—which was odd.

Eventually, I resolved it by manually launching Edge from the command line:

"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" --remote-debugging-port=9222
Enter fullscreen mode Exit fullscreen mode

Now the debugging server works correctly.

Next, we need to connect Bun.Webview to this browser instance. We can fetch the WebSocket debugging URL like this:

import axios from "axios";

async function getBrowserDebuggingURL(): Promise<string> {
  try {
    const response = await axios.get("http://localhost:9222/json/version");
    return response.data.webSocketDebuggerUrl;
  } catch (error) {
    const message = error instanceof Error ? error.message : String(error);
    console.error(`Failed to get browser debugging URL: ${message}`);
    throw new Error("Failed to get browser debugging URL");
  }
}

export { getBrowserDebuggingURL };
Enter fullscreen mode Exit fullscreen mode

Then pass it into Bun.Webview:

const view = new Bun.WebView({
  backend: {
    type: "chrome",
    url: await getBrowserDebuggingURL(),
  },
  headless: true,
});
Enter fullscreen mode Exit fullscreen mode

At this point, Bun should successfully connect to the Chrome backend.


Web Scraping & Formatting

The Bun.Webview API is similar to Playwright. We can extract page data like this:

const title = await view.evaluate(`
  document.title
  || document.querySelector('meta[property="og:title"]')?.content
  || document.querySelector('meta[name="twitter:title"]')?.content
  || document.querySelector('h1')?.textContent?.trim()
  || document.querySelector('h2')?.textContent?.trim()
  || ""
`);

const html = await view.evaluate("document.documentElement.outerHTML");
const text = await view.evaluate("document.documentElement.innerText");
Enter fullscreen mode Exit fullscreen mode

For processing, I built a custom parser using cheerio to clean up the DOM, removing unnecessary tags like script and style, keeping only the body, and then converting HTML into Markdown using @mizchi/readability.

This helps reduce token usage (and yes—save money 😄):

import { extract, toMarkdown } from "@mizchi/readability";
import * as cheerio from "cheerio";

function normalizeHtml(html: string) {
  try {
    const $ = cheerio.load(html);
    return $("body").html() ?? "";
  } catch (error) {
    console.warn("Failed to normalize HTML:", error);
    return html;
  }
}

async function htmlParser(url: string, html: string): Promise<string> {
  try {
    const normalizedHtml = normalizeHtml(html);
    const extracted = extract(normalizedHtml, {
      charThreshold: 100,
    });

    if (!extracted?.root) {
      console.warn(`No root element found: ${url}`);
      return "";
    }

    const parsed = toMarkdown(extracted.root);

    if (typeof parsed !== "string" || parsed.trim().length === 0) {
      console.warn(`Markdown conversion empty: ${url}`);
      return "";
    }

    return parsed;
  } catch (error) {
    const message = error instanceof Error ? error.message : String(error);
    console.error(`HTML parsing failed: ${message}`);
    return "";
  }
}

export { htmlParser };
Enter fullscreen mode Exit fullscreen mode

Unfortunately, Markdown conversion often fails. My guess is that the library is designed for readability-mode pages, and some sites don’t support that structure. In such cases, I fallback to innerText.

A quick test via curl:

curl -X POST http://localhost:9233/read \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-magic-access-token" \
  -d '{"url":"https://www.gengyue.site"}'
Enter fullscreen mode Exit fullscreen mode

Example output:

---
title: gengyue
url: https://www.gengyue.site
---

# Hi 👋!

...
Enter fullscreen mode Exit fullscreen mode

User-Agent & Plugin System

Time to test real-world scraping: Zhihu, Xiaohongshu, and WeChat.

Interestingly, Zhihu and Xiaohongshu worked fine, but WeChat triggered anti-scraping protection.

So I tried a trick: spoofing the User-Agent.

Here are some presets:

export const UA_PRESETS = {
  iPhone_WebView: "...MicroMessenger/8.0.49",
  iPhone_Safari: "...Safari/604.1",
  Android_WebView: "...MicroMessenger/8.0.49",
  Android_Chrome: "...",
  Desktop_Chrome: "...",
  Desktop_Safari: "...",
  Baidu_Spider: "...",
  Googlebot: "...",
} as const;
Enter fullscreen mode Exit fullscreen mode

Surprisingly, using an iPhone WebView UA bypassed WeChat restrictions.

We define a plugin:

const wechatPlugin = {
  name: "wechat",
  match(url: string): boolean {
    const hostname = new URL(url).hostname;
    return hostname.endsWith("mp.weixin.qq.com");
  },
  getUserAgent() {
    return UA_PRESETS.iPhone_WebView;
  },
};
Enter fullscreen mode Exit fullscreen mode

Then apply it:

await view.cdp("Network.setUserAgentOverride", {
  userAgent: matchedUA,
});
Enter fullscreen mode Exit fullscreen mode

Deployment

I originally built this for integration with a QQ bot, so I wrapped it into a simple HTTP backend using Bun + Hono.

Deployment is straightforward: clone the repo, install dependencies, and run with pm2.

On Ubuntu, install Chromium:

sudo apt update
sudo apt install -y ca-certificates fonts-liberation fonts-noto-cjk
sudo apt install -y chromium-browser
Enter fullscreen mode Exit fullscreen mode

To make Chromium behave more like a real browser, I also installed xvfb for non-headless mode.

A systemd service:

[Unit]
Description=Chromium Browser
After=network.target

[Service]
ExecStart=/usr/bin/xvfb-run --auto-servernum --server-args="-screen 0 1920x1080x24" \
  /usr/bin/chromium-browser \
  --no-sandbox \
  --remote-debugging-port=9222 \
  --user-data-dir=/tmp/chrome-debug-profile \
  https://www.google.com

Restart=always
RestartSec=10

[Install]
WantedBy=default.target
Enter fullscreen mode Exit fullscreen mode

Enable it:

systemctl --user daemon-reload
systemctl --user enable chromium.service
systemctl --user start chromium.service
Enter fullscreen mode Exit fullscreen mode

Test again:

curl -X POST http://localhost:9233/read \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-magic-access-token" \
  -d '{"url":"https://www.gengyue.site"}'
Enter fullscreen mode Exit fullscreen mode

Everything works 🎉


Closing Thoughts

This is still a rough experimental setup. It’s not production-grade, but it works surprisingly well as a lightweight scraping backend and can already power bots or automation workflows.

Source code: https://github.com/gengyue2468/fig

Original article published by myself in Chinese(简体中文):
https://www.gengyue.site/blog/build-fig-via-bun-webview/

Top comments (0)