I've spent the last few months building a system that extracts meta tags from URLs at scale. Along the way I hit every wall you can imagine — rate limits, CAPTCHAs, bot detection, encoding nightmares, and HTML so malformed it would make a parser cry.
Here's everything I learned, so you don't have to learn it the hard way.
The Simple Version (That Breaks Immediately)
Extracting meta tags seems trivial:
const res = await fetch(url);
const html = await res.text();
const title = html.match(/<title>(.*?)<\/title>/)?.[1];
This works for about 60% of websites. The other 40% will teach you humility.
Problem 1: Bot Detection
Many sites block requests that don't look like a real browser.
What Gets You Blocked
- Missing or generic
User-Agentheader - No
Accept,Accept-Language, orAccept-Encodingheaders - Requesting from cloud provider IP ranges (AWS, GCP, Azure)
- Making too many requests too fast
- Missing TLS fingerprint characteristics
What Works
Set headers that look like a real browser:
const response = await fetch(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0; +https://mysite.com/bot)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
},
redirect: 'follow',
});
Notice I'm identifying as a bot with a contact URL in the User-Agent. This is ethical best practice — you're being transparent about what you are. Many sites have specific bot policies and will treat identified bots better than anonymous scrapers.
Rate Limiting Yourself
Even if a site doesn't block you, be respectful:
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
async function fetchWithRateLimit(urls, delayMs = 1000) {
const results = [];
for (const url of urls) {
results.push(await fetchMetaTags(url));
await delay(delayMs);
}
return results;
}
Problem 2: Redirect Chains
A simple URL can lead you on a wild chase:
https://t.co/abc123
→ https://bit.ly/xyz789
→ https://example.com/blog?utm_source=twitter
→ https://example.com/blog/my-post
→ https://www.example.com/blog/my-post (www redirect)
→ https://www.example.com/blog/my-post/ (trailing slash)
That's 5 redirects just to reach the actual page.
How to Handle It
async function fetchWithRedirects(url, maxRedirects = 10) {
let currentUrl = url;
let redirectCount = 0;
while (redirectCount < maxRedirects) {
const response = await fetch(currentUrl, { redirect: 'manual' });
if (response.status >= 300 && response.status < 400) {
const location = response.headers.get('location');
if (!location) break;
// Handle relative redirects
currentUrl = new URL(location, currentUrl).href;
redirectCount++;
continue;
}
return { response, finalUrl: currentUrl, redirectCount };
}
throw new Error(`Too many redirects (${maxRedirects})`);
}
Why redirect: 'manual'? Because you want to track the final URL. The og:url tag often differs from the URL you started with, and you need the final URL to resolve relative image paths.
Problem 3: Character Encoding
The web is not all UTF-8. You'll encounter:
- Shift-JIS (Japanese sites)
- EUC-KR (Korean sites)
- GB2312 / GBK (Chinese sites)
- ISO-8859-1 (older European sites)
- Windows-1252 (legacy sites that claim ISO-8859-1 but use Windows-1252)
Detection Strategy
function detectEncoding(html, headers) {
// 1. Check HTTP Content-Type header
const contentType = headers.get('content-type') || '';
const charsetMatch = contentType.match(/charset=([\w-]+)/i);
if (charsetMatch) return charsetMatch[1];
// 2. Check HTML meta tag
const metaMatch = html.match(
/<meta[^>]+charset=["']?([\w-]+)/i
);
if (metaMatch) return metaMatch[1];
// 3. Check XML declaration
const xmlMatch = html.match(
/<\?xml[^>]+encoding=["']([\w-]+)/i
);
if (xmlMatch) return xmlMatch[1];
// 4. Default to UTF-8
return 'utf-8';
}
Problem 4: Malformed HTML
Real-world HTML is a horror show:
<!-- Unclosed tags -->
<meta property="og:title" content="Hello World>
<!-- Wrong quotes -->
<meta property='og:image' content=https://example.com/img.png>
<!-- Mixed case -->
<META PROPERTY="OG:TITLE" CONTENT="Hello">
<!-- Duplicate tags -->
<meta property="og:title" content="Title 1">
<meta property="og:title" content="Title 2">
How to Handle It
Don't use regex for the final parse. Use a proper HTML parser that handles malformed markup:
Node.js:
import { parse } from 'node-html-parser';
const root = parse(html);
const ogTitle = root.querySelector('meta[property="og:title"]')?.getAttribute('content');
Python:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
og_title = soup.find('meta', property='og:title')
title = og_title['content'] if og_title else None
For duplicates, always take the first occurrence — that's what most platforms do.
Problem 5: JavaScript-Rendered Content
Some sites render their OG tags with JavaScript. When you fetch the HTML, the <head> is empty or contains placeholder values.
This is increasingly common with SPAs (React, Vue, Angular).
Detection
If the HTML contains very little content but has large JS bundles, the site is likely JS-rendered:
function isLikelyJSRendered(html) {
const hasMinimalBody = html.match(/<body[^>]*>[\s\S]{0,500}<\/body>/i);
const hasReactRoot = html.includes('id="root"') || html.includes('id="app"') || html.includes('id="__next"');
const hasNoOGTags = !html.includes('og:title');
return (hasMinimalBody || hasReactRoot) && hasNoOGTags;
}
Solutions
- Check for server-side rendered alternatives — many SPAs serve different HTML to bots based on User-Agent
- Use a headless browser — Puppeteer or Playwright can execute JS, but it's 10-100x slower
- Accept the limitation — if the site doesn't serve OG tags to bots, there's no metadata to extract
Problem 6: Security (SSRF)
If your scraper accepts user-provided URLs, you MUST prevent Server-Side Request Forgery:
function validateUrl(url) {
const parsed = new URL(url);
// Only allow HTTP(S)
if (!['http:', 'https:'].includes(parsed.protocol)) {
throw new Error('Only HTTP(S) URLs allowed');
}
// Block private/internal IPs
const hostname = parsed.hostname;
if (
hostname === 'localhost' ||
hostname === '127.0.0.1' ||
hostname.startsWith('10.') ||
hostname.startsWith('192.168.') ||
hostname.startsWith('172.') ||
hostname.endsWith('.local') ||
hostname.endsWith('.internal')
) {
throw new Error('Private URLs not allowed');
}
return parsed;
}
Also set a reasonable timeout (5-15 seconds) and response size limit (1-5MB) to prevent resource exhaustion.
Problem 7: Favicon Discovery
Finding a site's favicon is surprisingly hard. There's no single standard location. You need to check, in order:
function findFavicon(html, baseUrl) {
const doc = parseHTML(html);
// 1. Explicit link tags (various rel values)
const selectors = [
'link[rel="icon"]',
'link[rel="shortcut icon"]',
'link[rel="apple-touch-icon"]',
'link[rel="apple-touch-icon-precomposed"]',
];
for (const selector of selectors) {
const el = doc.querySelector(selector);
if (el?.getAttribute('href')) {
return new URL(el.getAttribute('href'), baseUrl).href;
}
}
// 2. Default /favicon.ico
return new URL('/favicon.ico', baseUrl).href;
}
But even this isn't enough — some sites serve different favicons for different sizes, use SVG favicons, or have the favicon at a completely non-standard path.
The Architecture That Works
After all these lessons, here's the architecture I settled on:
1. Validate URL (SSRF protection)
2. Check cache (skip fetch if fresh)
3. Fetch with timeout (5s) and redirect following (max 10)
4. Detect encoding, decode to UTF-8
5. Parse HTML with a tolerant parser
6. Extract: OG tags → Twitter tags → HTML tags → fallbacks
7. Resolve relative URLs against final URL
8. Validate image URL (HEAD request, check content-type)
9. Find favicon
10. Cache result (1 hour TTL)
Each step has its own error handling. If any step fails, you still return whatever you managed to extract.
Performance Tips
-
Set
Accept-Encoding: gzip— most sites support it, reduces transfer size by 60-80% -
Only read the
<head>— you don't need the full page. Stop reading after</head>or after the first 50KB - DNS caching — if you're fetching many URLs from the same domain, cache DNS lookups
- Connection pooling — reuse TCP connections for same-origin requests
- Parallel fetching with concurrency limits — fetch 10 URLs at once, not 1 or 1000
Wrap Up
Meta tag extraction sounds simple. It's not. The real world has encoding issues, malformed HTML, bot detection, redirect chains, JS-rendered content, and security concerns.
But once you've handled these edge cases, you have a robust system that works on 95%+ of the web. The remaining 5% are sites that actively prevent any form of scraping — and that's their right.
What's the weirdest edge case you've hit when scraping websites? I'd love to hear your war stories in the comments.
Top comments (0)