DEV Community

Cover image for I built a plugin-based web metadata scraper with MCP server support
Mingi Choe
Mingi Choe

Posted on

I built a plugin-based web metadata scraper with MCP server support

TL;DR

I built web-meta-scraper — a lightweight TypeScript library that extracts metadata from web pages. 14 built-in plugins, 1 dependency, ~5KB. It also ships as an MCP server so AI assistants like Claude can use it as a tool.


Why I built this

I needed to extract Open Graph, Twitter Cards, and JSON-LD from URLs for a link preview feature. Existing solutions like metascraper (10+ dependencies, ~50KB) and open-graph-scraper (no plugin system, no customization) felt like overkill for what should be a simple task.

So I built one with a few goals:

  • Minimal dependencies — only cheerio for HTML parsing, native fetch() for HTTP
  • Composable — pick only the plugins you need
  • Type-safe — full TypeScript definitions for everything
  • Extensible — custom plugins are just functions

Quick Start

npm install web-meta-scraper
Enter fullscreen mode Exit fullscreen mode
import { scrape } from 'web-meta-scraper';

const result = await scrape('https://github.com');

console.log(result.metadata);
// {
//   title: "\"GitHub\","
//   description: "\"Build and ship software on a single...\","
//   image: "https://github.githubassets.com/assets/...",
//   ...
// }
Enter fullscreen mode Exit fullscreen mode

That's it. One function, everything extracted and merged.

Plugin Architecture

The core idea: each metadata source is a separate plugin. Use all of them or just the ones you need.

import { createScraper, openGraph, twitter, video, audio } from 'web-meta-scraper';

const scraper = createScraper({
  plugins: [openGraph, twitter, video, audio],
});
Enter fullscreen mode Exit fullscreen mode

14 Built-in Plugins

Plugin What it extracts
metaTags title, description, keywords, author, favicon, canonical URL
openGraph og:title, og:image, og:type, og:site_name, etc.
twitter twitter:card, twitter:image, twitter:creator, etc.
jsonLd Schema.org structured data (Article, Product, FAQ, etc.)
oembed oEmbed data with async endpoint fetching
favicons All icon variants (apple-touch-icon, mask-icon, manifest)
feeds RSS and Atom feed links
robots Robots directives, indexability flags
date Publication and modification dates
logo Site logo from OG, Schema.org, JSON-LD
lang Document language (BCP 47)
video Video resources from OG, twitter:player, <video>, JSON-LD
audio Audio resources from OG, <audio>, JSON-LD
iframe Embeddable iframe HTML

Custom Plugins

A plugin is just a function:

import type { Plugin } from 'web-meta-scraper';

const pricePlugin: Plugin = (ctx) => {
  const price = ctx.$('[itemprop="price"]').attr('content');
  return {
    name: 'price',
    data: { price },
  };
};
Enter fullscreen mode Exit fullscreen mode

Priority-Based Merging

When the same field exists in multiple sources, the highest-priority value wins automatically. And you can customize the rules:

const scraper = createScraper({
  plugins: [metaTags, openGraph, twitter],
  rules: [
    {
      field: 'title',
      sources: [
        { plugin: 'twitter', key: 'title', priority: 3 },   // Twitter first
        { plugin: 'open-graph', key: 'title', priority: 2 },
        { plugin: 'meta-tags', key: 'title', priority: 1 },
      ],
    },
  ],
});
Enter fullscreen mode Exit fullscreen mode

Beyond Scraping

SEO Validation

import { scrape, validateMetadata } from 'web-meta-scraper';

const result = await scrape('https://example.com');
const validation = validateMetadata(result);

console.log(validation.score);  // 85/100
console.log(validation.issues);
// [{ field: "description", severity: "warning", message: "Description is too short" }]
Enter fullscreen mode Exit fullscreen mode

Content Extraction

Strips navigation, ads, sidebars — returns clean text:

import { extractContent } from 'web-meta-scraper';

const content = await extractContent('https://example.com/article');
console.log(content.content);   // "Article body text..."
console.log(content.wordCount); // 1234
console.log(content.language);  // "en"
Enter fullscreen mode Exit fullscreen mode

Batch Scraping

import { batchScrape } from 'web-meta-scraper';

const results = await batchScrape(
  ['https://example.com', 'https://github.com', 'https://nodejs.org'],
  { concurrency: 3 },
);
Enter fullscreen mode Exit fullscreen mode

MCP Server — AI Assistants Can Use It

This is the part I'm most excited about.

web-meta-scraper-mcp exposes all scraping functionality as MCP (Model Context Protocol) tools. AI assistants like Claude can directly scrape, validate, and analyze web metadata.

Setup (one command)

Claude Code:

claude mcp add web-meta-scraper -- npx -y web-meta-scraper-mcp
Enter fullscreen mode Exit fullscreen mode

Claude Desktop / Cursor:

{
  "mcpServers": {
    "web-meta-scraper": {
      "command": "npx",
      "args": ["-y", "web-meta-scraper-mcp"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

7 Tools Available

Tool Description
scrape_url Extract metadata from a URL
scrape_html Extract metadata from raw HTML
batch_scrape Scrape multiple URLs concurrently
detect_feeds Detect RSS/Atom feeds
check_robots Check robots directives
validate_metadata SEO score report
extract_content Extract main text content

Once connected, you can ask Claude things like:

  • "Scrape this URL and summarize the metadata"
  • "Check if this page is SEO-friendly"
  • "Extract the main content from this article"
  • "Find all RSS feeds on this blog"

The MCP server handles everything — Claude just calls the tools and gets structured results back.

Try It

Live Playground: https://radiant-malabi-26e1e6.netlify.app/playground

Enter any URL, toggle stealth mode, and see extracted metadata, SEO validation, and content extraction in action.

Links

If you find it useful, a star on GitHub would mean a lot. Feedback and contributions are always welcome!

Top comments (0)