Wilson Xu

Posted on Mar 19

Extract Metadata from Any URL with a Single CLI Command

#webdev #seo #javascript #node

Every time you paste a link into Slack, Twitter, or LinkedIn, a small miracle happens behind the scenes. The platform fetches the page, parses its HTML, pulls out the title, description, and preview image, and renders a rich card -- all in under a second. But what happens when the card shows the wrong image? Or no description at all? Or when you need to audit hundreds of pages for SEO compliance?

That is the problem urlmeta-cli solves. One command, one URL, and you get everything: the page title, meta description, Open Graph tags, Twitter Card data, Schema.org markup, content statistics, and an SEO score with actionable recommendations. In this article, I will walk through why URL metadata matters, what the tool extracts, how its SEO scoring works, and how to use batch processing and JSON output to build metadata extraction into your workflows.

Why URL Metadata Matters

Metadata is the first impression your page makes when it never gets a visit. A link shared on social media, embedded in a Slack message, or indexed by a search engine is judged entirely by its metadata. Three areas make this critical:

Link previews. When someone shares your URL, platforms like Twitter, Facebook, LinkedIn, and Discord read your Open Graph and Twitter Card tags to generate a preview card. If og:title is missing, the platform guesses -- often badly. If og:image is absent, the card renders with a generic placeholder. A single missing tag can mean the difference between a click and a scroll-past.

SEO. Search engines use your <title>, <meta name="description">, canonical URL, heading structure, and language attributes to understand and rank your page. A title that is 90 characters long gets truncated in search results. A missing canonical URL can cause duplicate content issues. A page with three <h1> tags confuses crawlers about what the page is actually about.

Content auditing. If you manage a blog, documentation site, or marketing page, you need to verify that every page has the right metadata before it goes live. Doing this manually across dozens or hundreds of pages is not realistic. You need a tool that can check them all at once and flag what is broken.

What urlmeta-cli Extracts

Install it globally and point it at any URL:

npm install -g urlmeta-cli
urlmeta https://github.com

The output is a structured report that covers six categories of metadata.

HTML Meta Tags

The tool parses the fundamental meta tags that every page should have: <title>, <meta name="description">, canonical URL (<link rel="canonical">), language (<html lang="...">), author, published date, modified date, favicon, charset, viewport, robots directive, generator, and theme color. These are the baseline tags that search engines and browsers rely on.

The extraction logic handles common inconsistencies. Meta tags can use name or property attributes. Some sites capitalize tag names. The tool checks both meta[name="description"] and meta[name="Description"], so it works even with non-standard markup.

Open Graph Protocol

Open Graph tags control how your page appears when shared on Facebook, LinkedIn, Discord, and most other platforms. The tool extracts seven OG properties:

og:title -- the title shown in the preview card
og:description -- the snippet below the title
og:image -- the preview image (resolved to an absolute URL)
og:type -- article, website, product, etc.
og:site_name -- the name of your site
og:url -- the canonical URL for the shared content
og:locale -- the language/region of the content

Image URLs are resolved relative to the page URL, so even relative paths like /images/og.png are converted to their full absolute form in the output.

Twitter Cards

Twitter (and Bluesky, Mastodon, and others) use their own set of meta tags for card rendering. The tool extracts:

twitter:card -- the card type (summary, summary_large_image, etc.)
twitter:title and twitter:description -- override OG tags specifically for Twitter
twitter:image -- can differ from the OG image
twitter:site and twitter:creator -- the @handles associated with the content

The extraction handles both name and property attribute variants, since different CMS platforms generate these differently.

Schema.org / JSON-LD

Many modern sites embed structured data using JSON-LD <script> blocks. The tool parses these to extract the @type, name, and description fields. This gives you a quick read on whether a page has rich snippet potential for Google search results.

Content Statistics

Beyond metadata, the tool analyzes the actual page content:

Word count -- calculated after stripping scripts, styles, navigation, and footer elements
H1 tag -- the primary heading text and total count
H2 count -- secondary headings for content structure
Image count -- total <img> elements
Link count -- total <a> elements with href attributes

These numbers tell you whether a page has thin content, poor heading structure, or an unusually high link density.

Technical Details

The tool also reports server-side information: HTTP status code, response time in milliseconds, Content-Type header, Content-Length (formatted as human-readable bytes), and the charset encoding. A page that takes 4 seconds to respond has an SEO problem regardless of how good its tags are.

The SEO Score Calculator

This is where urlmeta-cli goes beyond simple extraction. After gathering all metadata, the tool computes an SEO score from 0 to 100, weighted across ten factors:

Factor	Points	Criteria
Title	15	Present (10) + ideal length 30-60 chars (5)
Description	15	Present (10) + ideal length 120-160 chars (5)
Open Graph	20	og:title (5) + og:description (5) + og:image (7) + og:type (3)
Twitter Card	10	twitter:card (4) + twitter:title (3) + twitter:image (3)
Canonical URL	10	Present
H1 Heading	10	Exactly one (10), more than one (3)
Language	5	lang attribute present
Favicon	5	Detected via link tags or /favicon.ico
Viewport	5	viewport meta tag present
Performance	5	Under 1s (5), under 3s (3), under 5s (1)

The score maps to a letter grade: 90+ is A+, 80+ is A, 70+ is B, 60+ is C, 50+ is D, and below 50 is F.

More importantly, the tool lists specific issues with actionable detail. Instead of just saying "title problem," it tells you: Title too long (73 chars, recommended: 30-60). Instead of a vague "social tags missing," you get: Missing og:image -- link previews will have no image.

This makes the tool practical for both quick spot-checks and systematic auditing.

Batch Processing Multiple URLs

Real-world metadata work almost never involves a single page. You need to audit your entire blog, check a set of landing pages, or compare competitors. Pass multiple URLs and the tool processes them sequentially:

urlmeta https://github.com https://npmjs.com https://dev.to

Each URL gets a full detailed report. At the end, a summary table shows all results side by side:

  Batch Summary
  ────────────────────────────────────────────────────────────────────────
  URL                                 Title                     SEO    Time     Status
  ────────────────────────────────────────────────────────────────────────
  https://github.com                  GitHub: Let's build...    87     376ms    200
  https://npmjs.com                   npm                       72     980ms    200
  https://dev.to                      DEV Community             91     412ms    200
  ────────────────────────────────────────────────────────────────────────

If you only want the summary table without the individual detailed reports, use the --summary flag:

urlmeta https://github.com https://npmjs.com https://dev.to --summary

Failed URLs (DNS failures, timeouts, server errors) are shown inline with the error message rather than crashing the entire batch. This is important when you are processing a list and cannot guarantee every URL is reachable.

You can combine batch mode with a file of URLs using shell expansion:

urlmeta $(cat urls.txt)

JSON Output for Scripting

The --json flag outputs all metadata as structured JSON, making the tool composable with other command-line utilities:

urlmeta https://github.com --json

This returns a single JSON object with every field. For multiple URLs, it returns a JSON array. Pipe it to jq for extraction:

# Get just the SEO-critical fields
urlmeta https://example.com --json | jq '{title, description, ogImage, ogTitle}'

# Check if og:image exists
urlmeta https://example.com --json | jq '.ogImage // "MISSING"'

# Batch extract all titles
urlmeta https://github.com https://npmjs.com --json | jq '.[].meta.title'

The --compact flag removes indentation for smaller payloads, useful when piping to another process or storing in a database:

urlmeta https://example.com --json --compact >> metadata-log.jsonl

CI/CD Integration

JSON output makes it straightforward to add metadata checks to your deployment pipeline. A simple approach:

#!/bin/bash
SCORE=$(urlmeta https://your-staging-site.com --json | jq '.statusCode')
if [ "$SCORE" != "200" ]; then
  echo "Staging site returned non-200 status"
  exit 1
fi

Or check that all required OG tags are present before merging a PR that changes your landing page.

Additional Options

A few more flags worth knowing:

--timeout <ms> -- set a custom request timeout (default is 10 seconds). Useful for slow sites or tight CI budgets.
--user-agent <string> -- send a custom User-Agent header. Some sites serve different content to bots versus browsers; this lets you test both scenarios.

Getting Started

npm install -g urlmeta-cli
urlmeta https://your-site.com

That is it. One command, a full metadata report, and an SEO score with specific issues to fix. For batch work, pass multiple URLs. For automation, add --json. The tool has no configuration files, no API keys, and no dependencies beyond Node.js.

The source is MIT-licensed and available on GitHub. If URL metadata is part of your workflow -- whether for link previews, SEO, content auditing, or competitive analysis -- give it a try.

More CLI tools from the same author:

urlmeta-cli -- extract metadata, Open Graph, Twitter Cards, and SEO score from any URL
websnap-reader -- capture and convert web pages into clean, readable Markdown
ghbounty -- find open-source bounties on GitHub issues
devpitch -- generate professional pitch decks for developer tools
pricemon -- monitor product prices and get alerts on drops
repo-readme-gen -- auto-generate polished README files from repository contents