Every app I've built in the last few years has needed the same thing: paste a URL, show a preview card. Slack does it. Discord does it. Every CMS does it. And every time, I end up writing the same cheerio scraping code, handling the same edge cases with Open Graph tags, and debugging the same issue where Twitter Cards use name instead of property and half the internet gets it wrong.
A little while ago I finally extracted all of that into a standalone API and put it on RapidAPI. Figured other people are writing the same code too.
What it actually returns
You pass a URL. You get back structured metadata across six layers:
- Open Graph (title, description, images with dimensions, article metadata)
- Twitter Cards (card type, site, creator; only tags that are actually present, no fallback guessing)
- HTML meta (title tag, meta description, canonical, theme color)
- Icons (auto-selects the highest quality favicon with a priority chain)
- Feeds (discovers RSS, Atom, and JSON Feed links)
- JSON-LD (parses all script blocks, prefers Article/Product over BreadcrumbList)
The response gives you both a merged top-level view (where the title comes from whichever source has it; OG first, then Twitter, then the title tag) and the raw parsed layers, so you can do your own logic.
Here's what GitHub returns:
{
"title": "GitHub ยท Build and ship software on a single, collaborative platform",
"description": "Join the world's most widely adopted...",
"image": {
"url": "https://github.githubassets.com/images/modules/site/social-cards/campaign-social.png",
"width": 1200,
"height": 630
},
"favicon": "https://github.githubassets.com/favicons/favicon.svg",
"siteName": "GitHub",
"type": "website",
"themeColor": "#1e2327",
"openGraph": { ... },
"twitter": { ... },
"feeds": [],
"jsonLd": { "@type": "WebSite", ... },
"responseTime": 234
}
The parts that were annoying to get right
A few things I ran into while building this that I suspect anyone writing their own scraper will hit too:
OG tags use property, Twitter uses name. The Open Graph spec says <meta property="og:title">. Twitter Cards say <meta name="twitter:card">. But a surprising number of sites swap them. The parser checks both attributes for both prefixes.
Multiple og:image tags are valid. The OG spec supports arrays by repeating the tag. Structured properties like og:image:width apply to the most recently declared og:image. Most scrapers just grab the first one and ignore the rest.
JSON-LD blocks are a mess. A typical news article page has three or four JSON-LD blocks: a BreadcrumbList, an Organization, and then the actual Article buried in the third one. You need to parse all of them and pick the right one.
Favicons have a priority order. Apple touch icons at 180x180 are usually the highest quality. Then standard icons at 32x32, then the generic /favicon.ico fallback. Most implementations just grab the first <link rel="icon"> they find.
Relative URLs everywhere. OG images and feed links are often relative paths. You need the effective URL (after redirects) as the base to resolve them correctly.
The technical approach
It's a Fastify server running on a VPS, using cheerio for HTML parsing. No headless browser, no Puppeteer, just fetch the HTML and parse it. This keeps response times under 500ms for cache misses and under 5ms for cache hits.
SSRF protection was the part I spent the most time on. Since the API accepts arbitrary URLs from users, you need to prevent people from using it to probe internal networks. The server resolves the hostname first, checks the IP against a blocklist of private ranges, then connects directly to the resolved IP to prevent DNS rebinding attacks.
I also built a text analytics API while I was at it, using the same infrastructure but for a different purpose. Pass in text, get back readability scores (Flesch-Kincaid, Coleman-Liau, SMOG, and three others), keyword density, bigrams, trigrams, and reading time. All pure math on strings, sub-10ms responses. Useful for content optimization tools and writing assistants.
Try them
Both APIs are on RapidAPI with a free tier (500 requests/month):
- LinkPreview โ URL metadata extraction
- TextAnalytics โ readability scores, keyword density, text metrics
The free tier is enough to test and prototype with.
If you run into any edge cases the parser doesn't handle well, I'd genuinely like to know. Parsing the wild HTML of the internet is a forever project.
Top comments (0)