eatyou eatyou

Posted on Mar 18

Web Scraping Meta Tags Without Getting Blocked — Lessons Learned

#webdev #javascript #python #tutorial

I've spent the last few months building a system that extracts meta tags from URLs at scale. Along the way I hit every wall you can imagine — rate limits, CAPTCHAs, bot detection, encoding nightmares, and HTML so malformed it would make a parser cry.

Here's everything I learned, so you don't have to learn it the hard way.

The Simple Version (That Breaks Immediately)

Extracting meta tags seems trivial:

const res = await fetch(url);
const html = await res.text();
const title = html.match(/<title>(.*?)<\/title>/)?.[1];

This works for about 60% of websites. The other 40% will teach you humility.

Problem 1: Bot Detection

Many sites block requests that don't look like a real browser.

What Gets You Blocked

Missing or generic User-Agent header
No Accept, Accept-Language, or Accept-Encoding headers
Requesting from cloud provider IP ranges (AWS, GCP, Azure)
Making too many requests too fast
Missing TLS fingerprint characteristics

What Works

Set headers that look like a real browser:

const response = await fetch(url, {
  headers: {
    'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0; +https://mysite.com/bot)',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
  },
  redirect: 'follow',
});

Notice I'm identifying as a bot with a contact URL in the User-Agent. This is ethical best practice — you're being transparent about what you are. Many sites have specific bot policies and will treat identified bots better than anonymous scrapers.

Rate Limiting Yourself

Even if a site doesn't block you, be respectful:

const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function fetchWithRateLimit(urls, delayMs = 1000) {
  const results = [];
  for (const url of urls) {
    results.push(await fetchMetaTags(url));
    await delay(delayMs);
  }
  return results;
}

Problem 2: Redirect Chains

A simple URL can lead you on a wild chase:

https://t.co/abc123
  → https://bit.ly/xyz789
    → https://example.com/blog?utm_source=twitter
      → https://example.com/blog/my-post
        → https://www.example.com/blog/my-post (www redirect)
          → https://www.example.com/blog/my-post/ (trailing slash)

That's 5 redirects just to reach the actual page.

How to Handle It

async function fetchWithRedirects(url, maxRedirects = 10) {
  let currentUrl = url;
  let redirectCount = 0;

  while (redirectCount < maxRedirects) {
    const response = await fetch(currentUrl, { redirect: 'manual' });

    if (response.status >= 300 && response.status < 400) {
      const location = response.headers.get('location');
      if (!location) break;

      // Handle relative redirects
      currentUrl = new URL(location, currentUrl).href;
      redirectCount++;
      continue;
    }

    return { response, finalUrl: currentUrl, redirectCount };
  }

  throw new Error(`Too many redirects (${maxRedirects})`);
}

Why redirect: 'manual'? Because you want to track the final URL. The og:url tag often differs from the URL you started with, and you need the final URL to resolve relative image paths.

Problem 3: Character Encoding

The web is not all UTF-8. You'll encounter:

Shift-JIS (Japanese sites)
EUC-KR (Korean sites)
GB2312 / GBK (Chinese sites)
ISO-8859-1 (older European sites)
Windows-1252 (legacy sites that claim ISO-8859-1 but use Windows-1252)

Detection Strategy

function detectEncoding(html, headers) {
  // 1. Check HTTP Content-Type header
  const contentType = headers.get('content-type') || '';
  const charsetMatch = contentType.match(/charset=([\w-]+)/i);
  if (charsetMatch) return charsetMatch[1];

  // 2. Check HTML meta tag
  const metaMatch = html.match(
    /<meta[^>]+charset=["']?([\w-]+)/i
  );
  if (metaMatch) return metaMatch[1];

  // 3. Check XML declaration
  const xmlMatch = html.match(
    /<\?xml[^>]+encoding=["']([\w-]+)/i
  );
  if (xmlMatch) return xmlMatch[1];

  // 4. Default to UTF-8
  return 'utf-8';
}

Problem 4: Malformed HTML

Real-world HTML is a horror show:

<!-- Unclosed tags -->
<meta property="og:title" content="Hello World>

<!-- Wrong quotes -->
<meta property='og:image' content=https://example.com/img.png>

<!-- Mixed case -->
<META PROPERTY="OG:TITLE" CONTENT="Hello">

<!-- Duplicate tags -->
<meta property="og:title" content="Title 1">
<meta property="og:title" content="Title 2">

How to Handle It

Don't use regex for the final parse. Use a proper HTML parser that handles malformed markup:

Node.js:

import { parse } from 'node-html-parser';

const root = parse(html);
const ogTitle = root.querySelector('meta[property="og:title"]')?.getAttribute('content');

Python:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
og_title = soup.find('meta', property='og:title')
title = og_title['content'] if og_title else None

For duplicates, always take the first occurrence — that's what most platforms do.

Problem 5: JavaScript-Rendered Content

Some sites render their OG tags with JavaScript. When you fetch the HTML, the <head> is empty or contains placeholder values.

This is increasingly common with SPAs (React, Vue, Angular).

Detection

If the HTML contains very little content but has large JS bundles, the site is likely JS-rendered:

function isLikelyJSRendered(html) {
  const hasMinimalBody = html.match(/<body[^>]*>[\s\S]{0,500}<\/body>/i);
  const hasReactRoot = html.includes('id="root"') || html.includes('id="app"') || html.includes('id="__next"');
  const hasNoOGTags = !html.includes('og:title');
  return (hasMinimalBody || hasReactRoot) && hasNoOGTags;
}

Solutions

Check for server-side rendered alternatives — many SPAs serve different HTML to bots based on User-Agent
Use a headless browser — Puppeteer or Playwright can execute JS, but it's 10-100x slower
Accept the limitation — if the site doesn't serve OG tags to bots, there's no metadata to extract

Problem 6: Security (SSRF)

If your scraper accepts user-provided URLs, you MUST prevent Server-Side Request Forgery:

function validateUrl(url) {
  const parsed = new URL(url);

  // Only allow HTTP(S)
  if (!['http:', 'https:'].includes(parsed.protocol)) {
    throw new Error('Only HTTP(S) URLs allowed');
  }

  // Block private/internal IPs
  const hostname = parsed.hostname;
  if (
    hostname === 'localhost' ||
    hostname === '127.0.0.1' ||
    hostname.startsWith('10.') ||
    hostname.startsWith('192.168.') ||
    hostname.startsWith('172.') ||
    hostname.endsWith('.local') ||
    hostname.endsWith('.internal')
  ) {
    throw new Error('Private URLs not allowed');
  }

  return parsed;
}

Also set a reasonable timeout (5-15 seconds) and response size limit (1-5MB) to prevent resource exhaustion.

Problem 7: Favicon Discovery

Finding a site's favicon is surprisingly hard. There's no single standard location. You need to check, in order:

function findFavicon(html, baseUrl) {
  const doc = parseHTML(html);

  // 1. Explicit link tags (various rel values)
  const selectors = [
    'link[rel="icon"]',
    'link[rel="shortcut icon"]',
    'link[rel="apple-touch-icon"]',
    'link[rel="apple-touch-icon-precomposed"]',
  ];

  for (const selector of selectors) {
    const el = doc.querySelector(selector);
    if (el?.getAttribute('href')) {
      return new URL(el.getAttribute('href'), baseUrl).href;
    }
  }

  // 2. Default /favicon.ico
  return new URL('/favicon.ico', baseUrl).href;
}

But even this isn't enough — some sites serve different favicons for different sizes, use SVG favicons, or have the favicon at a completely non-standard path.

The Architecture That Works

After all these lessons, here's the architecture I settled on:

1. Validate URL (SSRF protection)
2. Check cache (skip fetch if fresh)
3. Fetch with timeout (5s) and redirect following (max 10)
4. Detect encoding, decode to UTF-8
5. Parse HTML with a tolerant parser
6. Extract: OG tags → Twitter tags → HTML tags → fallbacks
7. Resolve relative URLs against final URL
8. Validate image URL (HEAD request, check content-type)
9. Find favicon
10. Cache result (1 hour TTL)

Each step has its own error handling. If any step fails, you still return whatever you managed to extract.

Performance Tips

Set Accept-Encoding: gzip — most sites support it, reduces transfer size by 60-80%
Only read the <head> — you don't need the full page. Stop reading after </head> or after the first 50KB
DNS caching — if you're fetching many URLs from the same domain, cache DNS lookups
Connection pooling — reuse TCP connections for same-origin requests
Parallel fetching with concurrency limits — fetch 10 URLs at once, not 1 or 1000

Wrap Up

Meta tag extraction sounds simple. It's not. The real world has encoding issues, malformed HTML, bot detection, redirect chains, JS-rendered content, and security concerns.

But once you've handled these edge cases, you have a robust system that works on 95%+ of the web. The remaining 5% are sites that actively prevent any form of scraping — and that's their right.

What's the weirdest edge case you've hit when scraping websites? I'd love to hear your war stories in the comments.

DEV Community

Web Scraping Meta Tags Without Getting Blocked — Lessons Learned

The Simple Version (That Breaks Immediately)

Problem 1: Bot Detection

What Gets You Blocked

What Works

Rate Limiting Yourself

Problem 2: Redirect Chains

How to Handle It

Problem 3: Character Encoding

Detection Strategy

Problem 4: Malformed HTML

How to Handle It

Problem 5: JavaScript-Rendered Content

Detection

Solutions

Problem 6: Security (SSRF)

Problem 7: Favicon Discovery

The Architecture That Works

Performance Tips

Wrap Up

Top comments (0)