Every developer has been there: you find an article you want to read, but the page is drowning in ads, sticky navbars, newsletter popups, cookie banners, and auto-playing videos. You just want the text. Browser reader modes help, but they are trapped inside the browser. What if you could extract clean, readable content from any URL directly in your terminal, pipe it to other tools, or batch-process hundreds of pages at once?
That is exactly what I built with websnap-reader -- a Node.js CLI tool that turns any webpage into clean Markdown. It uses Chrome DevTools Protocol (CDP) for JavaScript-heavy sites, a readability algorithm to strip away clutter, and optional AI-powered summaries. In this article, I will walk through the architecture, the key problems I solved, and the actual code that makes it work.
The Problem: Web Pages Are Cluttered
The average web page in 2026 is roughly 20% content and 80% everything else. Navigation bars, sidebars, related article widgets, social share buttons, comment sections, ad slots, tracking scripts -- all of it wraps around the one thing you actually came for: the article text.
If you are building any kind of content pipeline -- research tooling, an RSS-to-Markdown converter, a personal knowledge base, or just want to read an article in your terminal -- you need a way to extract the signal from the noise.
Existing tools each solve part of the problem. readability-cli gives you extracted HTML but not Markdown. percollate produces PDFs but is not pipe-friendly. trafilatura handles extraction well but lacks JavaScript rendering. I wanted one tool that could handle the full pipeline: fetch (including JS-rendered pages), extract, convert to Markdown, and optionally summarize.
Architecture Overview
The tool is structured as four composable modules:
-
Fetcher (
fetcher.ts) -- Retrieves raw HTML via Chrome CDP or plain HTTP -
Parser (
parser.ts) -- Strips noise and extracts the main content -
Formatter (
formatter.ts) -- Converts clean HTML to Markdown or JSON -
Summarizer (
summarizer.ts) -- Generates AI-powered summaries with multiple backend support
Each module has a single responsibility and can be used independently. The CLI orchestrator in index.ts wires them together based on the flags you pass.
Fetching Pages with Chrome DevTools Protocol
The biggest challenge with web scraping in 2026 is that many sites are single-page applications or heavily rely on client-side JavaScript rendering. A plain HTTP fetch will give you an empty <div id="root"></div> and nothing else.
The solution is Chrome DevTools Protocol. Instead of bundling a headless browser (which adds hundreds of megabytes to your install), websnap-reader connects to your already-running Chrome instance over its debugging port. This means you get full JavaScript rendering, existing cookies (so login-required pages just work), and zero extra dependencies.
Here is the core CDP fetching logic:
async function fetchViaCDP(
url: string,
cdpEndpoint: string,
timeout: number,
userAgent?: string
): Promise<string> {
// Get browser websocket URL
const versionUrl = `${cdpEndpoint}/json/version`;
const versionRes = await fetchWithTimeout(versionUrl, 3000);
const versionInfo = await versionRes.json();
// Create a new target (tab)
const newTabUrl = `${cdpEndpoint}/json/new?${encodeURIComponent("about:blank")}`;
const newTabRes = await fetchWithTimeout(newTabUrl, 3000);
const tabInfo = await newTabRes.json();
const wsUrl = tabInfo.webSocketDebuggerUrl;
// Connect via WebSocket
const session = await connectCDP(wsUrl);
try {
await session.send("Page.enable");
await session.send("Network.enable");
// Navigate to the target URL
await session.send("Page.navigate", { url });
// Wait for the page load event
await waitForLoad(session, timeout);
// Extra delay for JS rendering to complete
await new Promise((r) => setTimeout(r, 1500));
// Extract the fully-rendered HTML
const result = await session.send("Runtime.evaluate", {
expression: "document.documentElement.outerHTML",
returnByValue: true,
});
return result?.result?.value;
} finally {
// Clean up: close the tab
const closeUrl = `${cdpEndpoint}/json/close/${tabInfo.id}`;
await fetchWithTimeout(closeUrl, 2000).catch(() => {});
session.close();
}
}
The key insight here is the Runtime.evaluate call. Instead of trying to intercept network responses or parse partial HTML, we wait for the page to fully render and then ask Chrome for the complete DOM as a string. This gives us the exact same HTML that a human user would see.
The WebSocket connection to CDP is built from scratch using Node.js built-in APIs -- no Puppeteer, no Playwright, no heavy dependencies:
async function connectCDP(wsUrl: string): Promise<CDPSession> {
const WebSocket = (globalThis as any).WebSocket || (await getWebSocket());
return new Promise((resolve, reject) => {
const ws = new WebSocket(wsUrl);
let msgId = 0;
const pending = new Map<number, { resolve: Function; reject: Function }>();
ws.onopen = () => {
const session: CDPSession = {
ws,
id: 0,
send(method: string, params: Record<string, any> = {}) {
return new Promise((res, rej) => {
const id = ++msgId;
pending.set(id, { resolve: res, reject: rej });
ws.send(JSON.stringify({ id, method, params }));
});
},
close() { ws.close(); },
};
resolve(session);
};
ws.onmessage = (event: any) => {
const msg = JSON.parse(event.data);
if (msg.id && pending.has(msg.id)) {
const handler = pending.get(msg.id)!;
pending.delete(msg.id);
msg.error ? handler.reject(new Error(msg.error.message)) : handler.resolve(msg.result);
}
};
});
}
When Chrome is not running with remote debugging enabled, the tool gracefully falls back to a plain HTTP fetch with a realistic User-Agent header. It even handles TLS certificate issues by falling back to node:https with relaxed verification -- useful in corporate environments with proxy certificates.
Converting HTML to Clean Markdown with a Readability Algorithm
Raw HTML, even after JavaScript rendering, is still full of noise. The parser module implements a readability algorithm in three phases:
Phase 1: Strip non-content elements. Scripts, styles, iframes, SVGs, and other non-textual elements are completely removed:
const REMOVE_TAGS = new Set([
"script", "style", "noscript", "iframe", "object",
"embed", "applet", "link", "meta", "svg", "canvas",
"video", "audio", "source", "track", "map", "area",
]);
Phase 2: Remove navigation and noise. Elements matching common noise patterns (ads, sidebars, comments, share buttons, cookie banners) are identified by their class names and IDs, then stripped out:
const NOISE_PATTERNS = [
/\bad[s]?\b/i, /\bbanner\b/i, /\bcomment/i, /\bfooter/i,
/\bnav\b/i, /\bsidebar/i, /\bsocial/i, /\bsponsor/i,
/\bcookie/i, /\bnewsletter/i, /\bsubscri/i, /\boverlay/i,
// ... more patterns
];
Phase 3: Find the main content container. The algorithm looks for content in this priority order: <article> tags, <main> tags, elements with role="main", then elements whose class or ID matches content-indicative patterns like "article", "content", "entry", "post":
function extractMainContent(html: string): string {
// Try <article> tag first
const articleMatch = html.match(/<article[^>]*>([\s\S]*?)<\/article>/i);
if (articleMatch && articleMatch[1].length > 200) return articleMatch[1];
// Try <main> tag
const mainMatch = html.match(/<main[^>]*>([\s\S]*?)<\/main>/i);
if (mainMatch && mainMatch[1].length > 200) return mainMatch[1];
// Try content-indicative class/id patterns
for (const pattern of CONTENT_PATTERNS) {
const regex = new RegExp(
`<(div|section)[^>]*(?:class|id)="[^"]*${pattern.source}[^"]*"[^>]*>([\\s\\S]*?)<\\/\\1>`,
"gi"
);
// Find the largest matching container
let best = "";
let match;
while ((match = regex.exec(html)) !== null) {
if (match[2].length > best.length) best = match[2];
}
if (best.length > 200) return best;
}
// Fallback: entire <body>
const bodyMatch = html.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
return bodyMatch ? bodyMatch[1] : html;
}
The 200-character minimum threshold prevents the algorithm from latching onto small decorative elements that happen to be wrapped in an <article> tag.
Once the clean HTML is extracted, it is converted to Markdown using node-html-markdown with tuned settings for fenced code blocks, consistent bullet markers, and controlled newline behavior:
const nhm = new NodeHtmlMarkdown({
preferNativeParser: false,
codeBlockStyle: "fenced",
bulletMarker: "-",
strongDelimiter: "**",
emDelimiter: "*",
maxConsecutiveNewlines: 2,
});
Metadata -- title, author, publication date, site name -- is extracted from Open Graph tags, Twitter cards, and standard meta elements, giving you structured data alongside the content.
Adding AI-Powered Summaries
Sometimes you do not need the full article. You just need a quick summary to decide whether it is worth reading. The --summary flag generates a concise 3-sentence summary using whichever AI backend is available.
The summarizer follows a cascade pattern, trying backends in order until one succeeds:
export async function summarize(article: ParsedArticle): Promise<string> {
const prompt = formatSummaryPrompt(article);
// 1. Try OpenAI (if OPENAI_API_KEY is set)
if (config.openaiApiKey) {
try { return await summarizeOpenAI(prompt, config); }
catch (err) { /* fall through */ }
}
// 2. Try Anthropic (if ANTHROPIC_API_KEY is set)
if (config.anthropicApiKey) {
try { return await summarizeAnthropic(prompt, config); }
catch (err) { /* fall through */ }
}
// 3. Try Ollama (local, no API key needed)
try { return await summarizeOllama(prompt, config); }
catch (err) { /* fall through */ }
// 4. Fallback: extractive summary (no AI at all)
return extractiveSummary(article);
}
The extractive fallback is surprisingly effective. It scores sentences based on position (early sentences matter more), overlap with the title (topic relevance), and length (medium-length sentences are more informative). It picks the top three and returns them in their original order:
function extractiveSummary(article: ParsedArticle): string {
const sentences = text.split(/(?<=[.!?])\s+/)
.filter(s => { const w = s.split(/\s+/).length; return w >= 5 && w <= 50; });
const titleWords = new Set(
article.title.toLowerCase().split(/\s+/).filter(w => w.length > 3)
);
const scored = sentences.map((sentence, index) => {
let score = 0;
if (index === 0) score += 5; // First sentence bonus
else if (index === 1) score += 3;
const words = sentence.toLowerCase().split(/\s+/);
for (const w of words) {
if (titleWords.has(w)) score += 2; // Title overlap bonus
}
return { sentence, score, index };
});
scored.sort((a, b) => b.score - a.score);
const top3 = scored.slice(0, 3);
top3.sort((a, b) => a.index - b.index); // Restore original order
return top3.map(s => s.sentence).join(" ");
}
This means summaries work even with no API keys and no internet -- useful for offline research workflows.
Batch Processing Multiple URLs
Research often involves processing dozens or hundreds of URLs at once. The batch command reads a file of URLs (one per line) and processes them sequentially with a configurable delay:
# Create a file with URLs
cat > urls.txt << EOF
https://paulgraham.com/greatwork.html
https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/
https://martinfowler.com/bliki/MonolithFirst.html
EOF
# Batch process to individual markdown files
websnap batch urls.txt --outdir ./articles
# Batch process with AI summaries as JSON
websnap batch urls.txt --summary --json
The batch processor tracks success and failure counts, slugifies article titles for filenames, and writes progress to stderr so stdout remains clean for piping:
for (let i = 0; i < urls.length; i++) {
const url = urls[i];
process.stderr.write(`[${i + 1}/${urls.length}] ${url}\n`);
try {
const html = await fetchPage(url, options);
const article = parseContent(html, url);
if (options.outdir) {
const slug = slugify(article.title || `page-${i + 1}`);
const ext = options.json ? ".json" : ".md";
const outPath = path.join(options.outdir, slug + ext);
fs.writeFileSync(outPath, formatMarkdown(article, url), "utf-8");
}
successCount++;
} catch (err) {
failCount++;
}
// Configurable delay between requests
if (i < urls.length - 1 && delay > 0) {
await new Promise(r => setTimeout(r, delay));
}
}
Lines starting with # in the URL file are treated as comments, so you can annotate your URL lists.
Publishing as an npm CLI Tool
The tool is published as a global npm package. The #!/usr/bin/env node shebang at the top of index.ts and the bin field in package.json make it work as a standalone command after installation.
The output is designed to be Unix-friendly. Content goes to stdout, status messages go to stderr, and exit codes are meaningful. This means you can compose it with other tools:
# Pipe to glow for terminal rendering
websnap https://example.com | glow -
# Copy to clipboard on macOS
websnap https://example.com | pbcopy
# Pipe to an LLM for further analysis
websnap https://example.com | llm "What are the key arguments?"
# Extract just the title and word count with jq
websnap https://example.com --json | jq '{title, wordCount}'
Get Started
Install globally with npm:
npm install -g websnap-reader
Or try it without installing:
npx websnap-reader https://paulgraham.com/greatwork.html
For JavaScript-heavy sites, start Chrome with remote debugging:
google-chrome --remote-debugging-port=9222
Then websnap-reader will automatically detect and use it.
More tools from this series:
- websnap-reader -- Turn any URL into clean Markdown from your terminal
- gitpulse -- GitHub activity analytics and contribution insights
- depcheck-ai -- AI-powered dependency analysis for Node.js projects
- ghprofile-stats -- Generate detailed GitHub profile statistics
Top comments (0)