stabem

Posted on Feb 14

I built a free web scraping API with 5,000 requests/month

#python #webdev #api #tutorial

Hey DEV community! 👋

Six months ago, I was building a RAG pipeline for a side project and hit a wall that every AI developer knows too well: getting clean content from the web is painful.

YouTube transcripts? You need the YouTube Data API v3 (with quotas). Web pages? Beautiful Soup works, but half the web is JavaScript-rendered now. Twitter threads? Don't even get me started on the API pricing.

So I built ContentAPI — a single API endpoint that extracts clean, structured content from any URL. YouTube videos, web pages, Twitter/X threads, Reddit posts — you throw a URL at it, and it returns clean markdown or text.

Here's what I learned building it, and how you can use it for free.

The Problem: Every Content Source Is Different

When you're building AI-powered apps — chatbots, RAG systems, content analyzers — you spend 80% of your time on data ingestion. Each source has its own quirks:

YouTube: Rate limits, API keys, no transcripts for some videos
Web pages: JavaScript rendering, paywalls, cookie banners, cluttered HTML
Twitter/X: Expensive API, thread reconstruction, media handling
Reddit: Rate limiting, nested comment trees, deleted content

I wanted one endpoint to rule them all.

The Solution: One URL In, Clean Content Out

Here's the simplest example — extracting a web page:

import requests

response = requests.get(
    "https://api.getcontentapi.com/api/v1/web",
    params={"url": "https://example.com/article"},
    headers={"X-API-Key": "your_api_key"}
)

data = response.json()
print(data["data"]["title"])
print(data["data"]["content"])  # Clean markdown
print(data["data"]["word_count"])

That's it. No parsing HTML, no handling JavaScript, no dealing with cookie banners.

YouTube Transcripts Without the YouTube API

This is probably the feature I'm most proud of. Getting YouTube transcripts is notoriously painful:

# Extract transcript from any YouTube video
response = requests.get(
    "https://api.getcontentapi.com/api/v1/youtube/transcript",
    params={"url": "https://youtube.com/watch?v=dQw4w9WgXcQ"},
    headers={"X-API-Key": "your_api_key"}
)

data = response.json()["data"]
print(f"Title: {data['title']}")
print(f"Channel: {data['channel']}")
print(f"Duration: {data['duration']}s")
print(f"Transcript: {data['full_text'][:200]}...")

What makes it special:

No YouTube API key needed — ContentAPI handles it
AI fallback — if no captions exist, we use Whisper to transcribe the audio
Multi-language support — request transcripts in specific languages
Timestamped segments — get start, duration, and text for each segment

The Python SDK

I also built a Python SDK to make things even simpler:

pip install contentapi

from contentapi import ContentAPI

client = ContentAPI(api_key="your_api_key")

# Extract from any URL — auto-detects the source type
result = client.extract("https://youtube.com/watch?v=...")
print(result.title)
print(result.content)

# Batch processing — up to 10 URLs at once
results = client.batch([
    "https://youtube.com/watch?v=abc",
    "https://example.com/blog-post",
    "https://reddit.com/r/python/comments/xyz"
])

Structured Data Extraction

Need specific fields from a page? Use the /extract endpoint with a schema:

response = requests.post(
    "https://api.getcontentapi.com/api/v1/extract",
    headers={"X-API-Key": "your_api_key"},
    json={
        "url": "https://example.com/product-page",
        "schema": {
            "title": "string",
            "price": "number",
            "features": ["string"],
            "in_stock": "boolean"
        }
    }
)

# Returns structured JSON matching your schema
print(response.json()["data"]["extracted"])
# {"title": "...", "price": 29.99, "features": [...], "in_stock": true}

This is AI-powered extraction — it understands the page content and maps it to your schema.

Website Crawling

Need to ingest an entire documentation site? The crawl endpoint handles it:

# Start a crawl
response = requests.post(
    "https://api.getcontentapi.com/api/v1/crawl",
    headers={"X-API-Key": "your_api_key"},
    json={
        "url": "https://docs.example.com",
        "max_pages": 50,
        "max_depth": 3,
        "include_patterns": ["/docs/*"],
        "format": "markdown"
    }
)

crawl_id = response.json()["data"]["crawl_id"]

# Poll for results (or use webhook_url for push notifications)
import time
while True:
    status = requests.get(
        f"https://api.getcontentapi.com/api/v1/crawl/{crawl_id}",
        headers={"X-API-Key": "your_api_key"}
    ).json()["data"]

    if status["status"] in ["completed", "partial", "failed"]:
        break
    time.sleep(5)

# Process results
for page in status["results"]:
    print(f"{page['title']}: {page['word_count']} words")

What I Learned Building This

1. Web scraping is an arms race

Every month, sites change their anti-bot measures. Building a reliable extraction service means constantly adapting. We use a mix of headless browsers, custom parsers, and fallback strategies.

2. The free tier matters

I wanted ContentAPI to be genuinely useful for hobbyists and indie devs. The free tier gives you 5,000 requests/month — enough to build and prototype real applications. No credit card required.

3. AI makes extraction dramatically better

Traditional scraping is brittle — one HTML change breaks everything. Using LLMs for structured extraction means the API understands content, not just DOM structure.

4. Developers want simplicity

The most popular feature request was "just give me one endpoint for everything." That's why the batch endpoint exists — throw any mix of URLs at it and get clean content back.

The MCP Server

For AI developers using Claude, Cursor, or Windsurf — there's also an MCP (Model Context Protocol) server:

npx contentapi-mcp-server

This lets AI assistants directly extract web content during conversations. Your AI can read any URL, get YouTube transcripts, and crawl websites — all through natural language.

Try It Free

Website: getcontentapi.com
Docs: getcontentapi.com/docs
Python SDK: pip install contentapi
MCP Server: npx contentapi-mcp-server
Free tier: 5,000 requests/month, no credit card

I'd love to hear your feedback. What content sources would you want supported next? Drop a comment below! 👇

DEV Community