DEV Community

Cover image for I Built a URL Metadata API After Wasting Days on Manual Scraping
root
root

Posted on

I Built a URL Metadata API After Wasting Days on Manual Scraping

The Problem That Wouldn't Go Away

I was building a bookmark manager (because apparently that's what developers do when they can't find the perfect one). Everything was going smoothly until I hit the "add bookmark" feature.

Users paste a URL, and I needed to show them a nice preview card with:

  • Title
  • Description
  • Image
  • Favicon

Simple, right? Wrong.

First Attempt: The BeautifulSoup Nightmare

Started with Python and BeautifulSoup. Wrote some code to fetch HTML and parse meta tags:

from bs4 import BeautifulSoup
import requests

def get_metadata(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    title = soup.find('meta', property='og:title')
    description = soup.find('meta', property='og:description')
    image = soup.find('meta', property='og:image')

    return {
        'title': title['content'] if title else None,
        'description': description['content'] if description else None,
        'image': image['content'] if image else None
    }
Enter fullscreen mode Exit fullscreen mode

Issues I immediately ran into:

  1. JavaScript-rendered sites - Half the modern web doesn't have meta tags in the initial HTML. They're added client-side.

  2. Inconsistent meta tag formats - Some sites use og:title, others use twitter:title, some just use <title>. You need fallback logic for everything.

  3. Bot detection - Cloudflare and other services block obvious scrapers. My IP got banned testing on my own URLs.

  4. Timeouts - Some sites take 10+ seconds to respond. Had to implement aggressive timeouts and retry logic.

  5. Malformed HTML - So many sites have broken HTML that crashes parsers.

After three days of fighting edge cases, I gave up on the DIY approach.

Second Attempt: Existing APIs

Found a few metadata extraction APIs. The cheapest one was $49/month for 10,000 requests. For a side project that might get 100 users? No thanks.

The Solution: Build My Own

I realized this problem isn't going away. Every developer building bookmarks, chat apps, or content aggregators hits this wall.

So I built Scrapix - a production-ready metadata extraction API that handles all the annoying edge cases.

How It Works

Simple API Call

const response = await fetch('https://scrapix-api.com/extract', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-RapidAPI-Key': 'your-api-key'
  },
  body: JSON.stringify({ url: 'https://example.com' })
});

const metadata = await response.json();
Enter fullscreen mode Exit fullscreen mode

Response Format

{
  "url": "https://dev.to",
  "title": "DEV Community",
  "description": "Where programmers share ideas and help each other grow",
  "image": "https://dev.to/social-preview.png",
  "favicon": "https://dev.to/favicon.ico",
  "site_name": "DEV Community"
}
Enter fullscreen mode Exit fullscreen mode

Clean, consistent, predictable.

Real-World React Example

Here's how I use it in my bookmark manager:

import { useState } from 'react';

function BookmarkForm() {
  const [url, setUrl] = useState('');
  const [metadata, setMetadata] = useState(null);
  const [loading, setLoading] = useState(false);

  const fetchMetadata = async () => {
    setLoading(true);

    const response = await fetch('https://scrapix-api.com/extract', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-RapidAPI-Key': process.env.RAPIDAPI_KEY
      },
      body: JSON.stringify({ url })
    });

    const data = await response.json();
    setMetadata(data);
    setLoading(false);
  };

  return (
    <div>
      <input 
        value={url} 
        onChange={(e) => setUrl(e.target.value)}
        placeholder="Paste URL..."
      />
      <button onClick={fetchMetadata}>
        {loading ? 'Loading...' : 'Fetch Preview'}
      </button>

      {metadata && (
        <div className="preview-card">
          <img src={metadata.image} alt={metadata.title} />
          <h3>{metadata.title}</h3>
          <p>{metadata.description}</p>
        </div>
      )}
    </div>
  );
}
Enter fullscreen mode Exit fullscreen mode

Under the Hood: How I Solved the Hard Problems

1. JavaScript-Rendered Content

Used headless browsers (Puppeteer) for sites that require client-side rendering. The API detects whether a page needs JS execution and routes accordingly.

2. Smart Caching

5-minute cache on all metadata requests. Most link previews don't change that often, so this cuts redundant requests by ~70%.

3. Multiple Fallback Strategies

The parser tries in order:

  • OpenGraph tags (og:*)
  • Twitter Cards (twitter:*)
  • Standard meta tags
  • HTML tags (<title>, <h1>)
  • URL parsing (domain name as fallback)

4. Global Performance

Deployed on Vercel's edge network. Average response time: 800ms. Cached responses: <100ms.

5. Batch Processing

Need metadata for 10 URLs at once? One API call handles it.

When Should You Use This vs DIY?

Build it yourself if:

  • You have one specific site to scrape
  • You need very custom data extraction
  • You're building a scraping tool as the core product

Use an API if:

  • You need it to work across any website
  • You don't want to maintain parsing logic
  • Your time is better spent on your actual product
  • You need reliable uptime and performance

Common Use Cases

I've seen developers use Scrapix for:

  • Bookmark managers - Auto-fill card previews
  • Chat apps - Rich link previews (like Slack/Discord)
  • Content aggregators - Pull article metadata
  • Social schedulers - Preview posts before publishing
  • CMS platforms - Auto-populate article details
  • RSS readers - Enhance feed items with images

Performance Matters

For my bookmark app, metadata extraction went from:

  • DIY approach: 3-15 seconds (with failures)
  • With Scrapix: 800ms average, 100ms cached

That's the difference between users waiting and users not noticing.

Pricing That Makes Sense

Free tier: 100 requests/month (perfect for testing)

Basic: 1,000 requests/month

Pro: 10,000 requests/month

(Check RapidAPI for current pricing)

Try It Yourself

The API is live on RapidAPI: Scrapix API

Free tier means you can test it with zero risk. If it doesn't solve your problem, you've lost nothing.

What I Learned Building This

  1. Edge cases are infinite - Every site has its own quirks
  2. Performance beats features - Developers want fast, reliable, simple
  3. Good APIs solve one problem perfectly - Don't try to do everything
  4. Caching is magic - 5-minute TTL made costs drop dramatically

The Bottom Line

If you're building anything that needs link previews, don't waste three days like I did. Your time is worth more than that.

Focus on what makes your product unique. Let specialized tools handle the boring infrastructure stuff.


Have you built something similar? What challenges did you face? Drop a comment below - always curious to hear other developers' experiences with web scraping and metadata extraction.


P.S. - If you found this helpful, I'd love to hear what you're building! Connect with me on [https://github.com/fiston-user]

Top comments (0)