Mingi Choe

Posted on Feb 23

I built a plugin-based web metadata scraper with MCP server support

#ai #opensource #typescript #webdev

TL;DR

I built web-meta-scraper — a lightweight TypeScript library that extracts metadata from web pages. 14 built-in plugins, 1 dependency, ~5KB. It also ships as an MCP server so AI assistants like Claude can use it as a tool.

Why I built this

I needed to extract Open Graph, Twitter Cards, and JSON-LD from URLs for a link preview feature. Existing solutions like metascraper (10+ dependencies, ~50KB) and open-graph-scraper (no plugin system, no customization) felt like overkill for what should be a simple task.

So I built one with a few goals:

Minimal dependencies — only cheerio for HTML parsing, native fetch() for HTTP
Composable — pick only the plugins you need
Type-safe — full TypeScript definitions for everything
Extensible — custom plugins are just functions

Quick Start

npm install web-meta-scraper

import { scrape } from 'web-meta-scraper';

const result = await scrape('https://github.com');

console.log(result.metadata);
// {
//   title: "\"GitHub\","
//   description: "\"Build and ship software on a single...\","
//   image: "https://github.githubassets.com/assets/...",
//   ...
// }

That's it. One function, everything extracted and merged.

Plugin Architecture

The core idea: each metadata source is a separate plugin. Use all of them or just the ones you need.

import { createScraper, openGraph, twitter, video, audio } from 'web-meta-scraper';

const scraper = createScraper({
  plugins: [openGraph, twitter, video, audio],
});

14 Built-in Plugins

Plugin	What it extracts
`metaTags`	title, description, keywords, author, favicon, canonical URL
`openGraph`	og:title, og:image, og:type, og:site_name, etc.
`twitter`	twitter:card, twitter:image, twitter:creator, etc.
`jsonLd`	Schema.org structured data (Article, Product, FAQ, etc.)
`oembed`	oEmbed data with async endpoint fetching
`favicons`	All icon variants (apple-touch-icon, mask-icon, manifest)
`feeds`	RSS and Atom feed links
`robots`	Robots directives, indexability flags
`date`	Publication and modification dates
`logo`	Site logo from OG, Schema.org, JSON-LD
`lang`	Document language (BCP 47)
`video`	Video resources from OG, twitter:player, `<video>`, JSON-LD
`audio`	Audio resources from OG, `<audio>`, JSON-LD
`iframe`	Embeddable iframe HTML

Custom Plugins

A plugin is just a function:

import type { Plugin } from 'web-meta-scraper';

const pricePlugin: Plugin = (ctx) => {
  const price = ctx.$('[itemprop="price"]').attr('content');
  return {
    name: 'price',
    data: { price },
  };
};

Priority-Based Merging

When the same field exists in multiple sources, the highest-priority value wins automatically. And you can customize the rules:

const scraper = createScraper({
  plugins: [metaTags, openGraph, twitter],
  rules: [
    {
      field: 'title',
      sources: [
        { plugin: 'twitter', key: 'title', priority: 3 },   // Twitter first
        { plugin: 'open-graph', key: 'title', priority: 2 },
        { plugin: 'meta-tags', key: 'title', priority: 1 },
      ],
    },
  ],
});

Beyond Scraping

SEO Validation

import { scrape, validateMetadata } from 'web-meta-scraper';

const result = await scrape('https://example.com');
const validation = validateMetadata(result);

console.log(validation.score);  // 85/100
console.log(validation.issues);
// [{ field: "description", severity: "warning", message: "Description is too short" }]

Content Extraction

Strips navigation, ads, sidebars — returns clean text:

import { extractContent } from 'web-meta-scraper';

const content = await extractContent('https://example.com/article');
console.log(content.content);   // "Article body text..."
console.log(content.wordCount); // 1234
console.log(content.language);  // "en"

Batch Scraping

import { batchScrape } from 'web-meta-scraper';

const results = await batchScrape(
  ['https://example.com', 'https://github.com', 'https://nodejs.org'],
  { concurrency: 3 },
);

MCP Server — AI Assistants Can Use It

This is the part I'm most excited about.

web-meta-scraper-mcp exposes all scraping functionality as MCP (Model Context Protocol) tools. AI assistants like Claude can directly scrape, validate, and analyze web metadata.

Setup (one command)

Claude Code:

claude mcp add web-meta-scraper -- npx -y web-meta-scraper-mcp

Claude Desktop / Cursor:

{
  "mcpServers": {
    "web-meta-scraper": {
      "command": "npx",
      "args": ["-y", "web-meta-scraper-mcp"]
    }
  }
}

7 Tools Available

Tool	Description
`scrape_url`	Extract metadata from a URL
`scrape_html`	Extract metadata from raw HTML
`batch_scrape`	Scrape multiple URLs concurrently
`detect_feeds`	Detect RSS/Atom feeds
`check_robots`	Check robots directives
`validate_metadata`	SEO score report
`extract_content`	Extract main text content

Once connected, you can ask Claude things like:

"Scrape this URL and summarize the metadata"
"Check if this page is SEO-friendly"
"Extract the main content from this article"
"Find all RSS feeds on this blog"

The MCP server handles everything — Claude just calls the tools and gets structured results back.

Try It

Live Playground: https://radiant-malabi-26e1e6.netlify.app/playground

Enter any URL, toggle stealth mode, and see extracted metadata, SEO validation, and content extraction in action.

DEV Community