TL;DR
I built web-meta-scraper — a lightweight TypeScript library that extracts metadata from web pages. 14 built-in plugins, 1 dependency, ~5KB. It also ships as an MCP server so AI assistants like Claude can use it as a tool.
Why I built this
I needed to extract Open Graph, Twitter Cards, and JSON-LD from URLs for a link preview feature. Existing solutions like metascraper (10+ dependencies, ~50KB) and open-graph-scraper (no plugin system, no customization) felt like overkill for what should be a simple task.
So I built one with a few goals:
-
Minimal dependencies — only cheerio for HTML parsing, native
fetch()for HTTP - Composable — pick only the plugins you need
- Type-safe — full TypeScript definitions for everything
- Extensible — custom plugins are just functions
Quick Start
npm install web-meta-scraper
import { scrape } from 'web-meta-scraper';
const result = await scrape('https://github.com');
console.log(result.metadata);
// {
// title: "\"GitHub\","
// description: "\"Build and ship software on a single...\","
// image: "https://github.githubassets.com/assets/...",
// ...
// }
That's it. One function, everything extracted and merged.
Plugin Architecture
The core idea: each metadata source is a separate plugin. Use all of them or just the ones you need.
import { createScraper, openGraph, twitter, video, audio } from 'web-meta-scraper';
const scraper = createScraper({
plugins: [openGraph, twitter, video, audio],
});
14 Built-in Plugins
| Plugin | What it extracts |
|---|---|
metaTags |
title, description, keywords, author, favicon, canonical URL |
openGraph |
og:title, og:image, og:type, og:site_name, etc. |
twitter |
twitter:card, twitter:image, twitter:creator, etc. |
jsonLd |
Schema.org structured data (Article, Product, FAQ, etc.) |
oembed |
oEmbed data with async endpoint fetching |
favicons |
All icon variants (apple-touch-icon, mask-icon, manifest) |
feeds |
RSS and Atom feed links |
robots |
Robots directives, indexability flags |
date |
Publication and modification dates |
logo |
Site logo from OG, Schema.org, JSON-LD |
lang |
Document language (BCP 47) |
video |
Video resources from OG, twitter:player, <video>, JSON-LD |
audio |
Audio resources from OG, <audio>, JSON-LD |
iframe |
Embeddable iframe HTML |
Custom Plugins
A plugin is just a function:
import type { Plugin } from 'web-meta-scraper';
const pricePlugin: Plugin = (ctx) => {
const price = ctx.$('[itemprop="price"]').attr('content');
return {
name: 'price',
data: { price },
};
};
Priority-Based Merging
When the same field exists in multiple sources, the highest-priority value wins automatically. And you can customize the rules:
const scraper = createScraper({
plugins: [metaTags, openGraph, twitter],
rules: [
{
field: 'title',
sources: [
{ plugin: 'twitter', key: 'title', priority: 3 }, // Twitter first
{ plugin: 'open-graph', key: 'title', priority: 2 },
{ plugin: 'meta-tags', key: 'title', priority: 1 },
],
},
],
});
Beyond Scraping
SEO Validation
import { scrape, validateMetadata } from 'web-meta-scraper';
const result = await scrape('https://example.com');
const validation = validateMetadata(result);
console.log(validation.score); // 85/100
console.log(validation.issues);
// [{ field: "description", severity: "warning", message: "Description is too short" }]
Content Extraction
Strips navigation, ads, sidebars — returns clean text:
import { extractContent } from 'web-meta-scraper';
const content = await extractContent('https://example.com/article');
console.log(content.content); // "Article body text..."
console.log(content.wordCount); // 1234
console.log(content.language); // "en"
Batch Scraping
import { batchScrape } from 'web-meta-scraper';
const results = await batchScrape(
['https://example.com', 'https://github.com', 'https://nodejs.org'],
{ concurrency: 3 },
);
MCP Server — AI Assistants Can Use It
This is the part I'm most excited about.
web-meta-scraper-mcp exposes all scraping functionality as MCP (Model Context Protocol) tools. AI assistants like Claude can directly scrape, validate, and analyze web metadata.
Setup (one command)
Claude Code:
claude mcp add web-meta-scraper -- npx -y web-meta-scraper-mcp
Claude Desktop / Cursor:
{
"mcpServers": {
"web-meta-scraper": {
"command": "npx",
"args": ["-y", "web-meta-scraper-mcp"]
}
}
}
7 Tools Available
| Tool | Description |
|---|---|
scrape_url |
Extract metadata from a URL |
scrape_html |
Extract metadata from raw HTML |
batch_scrape |
Scrape multiple URLs concurrently |
detect_feeds |
Detect RSS/Atom feeds |
check_robots |
Check robots directives |
validate_metadata |
SEO score report |
extract_content |
Extract main text content |
Once connected, you can ask Claude things like:
- "Scrape this URL and summarize the metadata"
- "Check if this page is SEO-friendly"
- "Extract the main content from this article"
- "Find all RSS feeds on this blog"
The MCP server handles everything — Claude just calls the tools and gets structured results back.
Try It
Live Playground: https://radiant-malabi-26e1e6.netlify.app/playground
Enter any URL, toggle stealth mode, and see extracted metadata, SEO validation, and content extraction in action.
Links
- GitHub: github.com/cmg8431/web-meta-scraper
- npm: web-meta-scraper
- MCP package: web-meta-scraper-mcp
- Docs: https://radiant-malabi-26e1e6.netlify.app
If you find it useful, a star on GitHub would mean a lot. Feedback and contributions are always welcome!
Top comments (0)