DEV Community

Boris Barac
Boris Barac

Posted on

Linkloom - AIWebReader

LinkLoom

A web scraping and content extraction toolkit for TypeScript/Bun.

Pass a URL, get clean markdown. That's the core. But LinkLoom also handles the cases that break simple scrapers: JavaScript-heavy pages rendered through a stealth browser, PDFs parsed into structured text, iframes pulled from nested frames, HTML tables converted to markdown tables, links extracted and classified. It exposes a library API, a CLI, and an MCP server — so you can use it from code, from the terminal, or from an AI client like Claude Desktop or Cursor.

The full list: URL-to-markdown conversion, HTML-to-markdown via Readability + Turndown, PDF-to-markdown via pdf.js, headless browser rendering through Camoufox (stealth Firefox on Playwright), iframe extraction with configurable wait strategies, link extraction and classification, table scraping, text embeddings via OpenAI or Gemini, a CLI for every feature, and an MCP server for AI tool-use workflows.

Built with Bun, Camoufox, JSDOM, Readability, Turndown, and pdf.js-extract. Optional embedding support through LangChain.

import { convertLinkToMarkdown } from "linkloom";

const markdown = await convertLinkToMarkdown("https://example.com");
Enter fullscreen mode Exit fullscreen mode

That's it. One import, one call. The function auto-detects whether the URL points to an HTML page or a PDF and routes it to the right converter. You get back a string of clean markdown — no boilerplate, no configuration objects, no setup ceremony.

The CLI equivalent:

bunx @boris.barac/linkloom scrape https://example.com
Enter fullscreen mode Exit fullscreen mode

Same result, different interface. Pipe it, redirect it, pass -o output.md to write to a file.

But plenty of pages don't hand you their content on the first request. They render everything with JavaScript — SPAs, dashboards, dynamically loaded articles. A simple fetch returns an empty shell. LinkLoom handles this through headless browser rendering via Camoufox, a stealth Firefox build on Playwright that avoids bot detection.

import { renderers } from "linkloom";

const browser = await renderers.puppeterRendered.initialize();
const result = await renderers.puppeterRendered.renderPage(browser, url, {
  timeout: 15000,
  waitUntil: "networkidle",
  viewport: { width: 1920, height: 1080 },
  frames: { enabled: true, timeout: 5000 },
});
await browser.close();
Enter fullscreen mode Exit fullscreen mode

The renderPage function loads the URL in a real browser, waits for the network to settle (or for a specific event), and returns the rendered HTML. The frames option tells it to also extract content from nested iframes — with its own timeout, because iframes load on their own schedule and you don't want one slow frame to block everything.

The CLI version:

bunx @boris.barac/linkloom render https://example.com --wait-until networkidle --timeout 15000
Enter fullscreen mode Exit fullscreen mode

Add --selector "table.stats" to extract only a specific element instead of the full page. Useful when you know exactly what you're after.

Then there are PDFs. Research papers, technical reports, product documentation — a surprising amount of the web's useful content lives in PDFs, not HTML pages. The same convertLinkToMarkdown call handles both, but you can also convert PDFs directly:

import { pdfConverter } from "linkloom";
import { readFile } from "node:fs/promises";

const buffer = await readFile("document.pdf");
const markdown = await pdfConverter.convertPdfToMarkdown(buffer);
const text = await pdfConverter.convertPdfToText(buffer);
Enter fullscreen mode Exit fullscreen mode

Two output modes: convertPdfToMarkdown preserves structure (headings, lists, formatting), while convertPdfToText strips everything down to plain text. Pick whichever fits your pipeline.

The CLI:

bunx @boris.barac/linkloom pdf document.pdf -o output.md
Enter fullscreen mode Exit fullscreen mode

Under the hood it uses pdf.js-extract to parse the binary, so there's no external dependency on system tools like pdftotext. It works out of the box.

Content conversion is half the job. The other half is pulling structured data out of pages — links, tables, the things that aren't prose.

Link extraction finds and classifies URLs from plain text or HTML. Feed it a string and it returns every link, tagged as a PDF or a regular page:

import { linkExtraction } from "linkloom";

const links = linkExtraction.extractLinks("check https://example.com/doc.pdf");
const pdfLinks = await linkExtraction.extractDownloadLinksFromHtml(htmlContent);
Enter fullscreen mode Exit fullscreen mode

extractLinks works on raw text — it finds URLs and classifies them. extractDownloadLinksFromHtml parses an HTML document and pulls out links that point to downloadable files (PDFs, mostly). Useful when you're crawling a page and want to know which links lead to documents worth converting.

Table extraction renders a page in the headless browser and pulls out HTML tables as structured data:

import { tableExtraction, renderers } from "linkloom";

const browser = await renderers.puppeterRendered.initialize();
const data = await tableExtraction.extractTableData(browser, url, "table");
const md = tableExtraction.tableDataToMarkdownTable(data);
await browser.close();
Enter fullscreen mode Exit fullscreen mode

The third argument is a CSS selector — pass "table" for all tables, or "table.stats" for a specific one. The output is a markdown table string, ready to drop into a document.

The CLI shortcuts:

bunx @boris.barac/linkloom links https://example.com
bunx @boris.barac/linkloom tables https://example.com/data --selector "table.stats"
Enter fullscreen mode Exit fullscreen mode

All of this is also available as an MCP server. If you use Claude Desktop, Cursor, or any MCP-compatible client, you can expose LinkLoom's tools without writing code — the AI calls them directly.

Six tools: scrape, html_to_markdown, pdf_to_markdown, render_page, extract_links, extract_tables. Same capabilities as the library and CLI, but surfaced as tool calls an AI agent can use autonomously.

Configuration is a few lines of JSON. For Claude Desktop, edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "linkloom": {
      "command": "bun",
      "args": ["x", "@boris.barac/linkloom", "mcp"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

For Cursor, add the same block to .cursor/mcp.json in your project or ~/.cursor/mcp.json globally. Any MCP client — point it at bun x @boris.barac/linkloom mcp and it works.

The server communicates over stdio. It reads JSON-RPC from stdin and writes responses to stdout. You don't run it directly; MCP clients spawn it as a child process. If you want to test it interactively, there's the MCP Inspector:

bunx @modelcontextprotocol/inspector bunx @boris.barac/linkloom mcp
Enter fullscreen mode Exit fullscreen mode

That opens a web UI where you can browse the available tools, call them with custom parameters, and inspect the JSON-RPC messages going back and forth.

Get started

bun add @boris.barac/linkloom
Enter fullscreen mode Exit fullscreen mode

Or skip the install and use it directly:

bunx @boris.barac/linkloom scrape https://example.com
Enter fullscreen mode Exit fullscreen mode

No API keys needed for the core scraping pipeline. Only the optional text embedding feature requires an OpenAI or Gemini key.

Top comments (1)

Collapse
 
boris9027 profile image
Boris Barac