Hey DEV community! 👋
Six months ago, I was building a RAG pipeline for a side project and hit a wall that every AI developer knows too well: getting clean content from the web is painful.
YouTube transcripts? You need the YouTube Data API v3 (with quotas). Web pages? Beautiful Soup works, but half the web is JavaScript-rendered now. Twitter threads? Don't even get me started on the API pricing.
So I built ContentAPI — a single API endpoint that extracts clean, structured content from any URL. YouTube videos, web pages, Twitter/X threads, Reddit posts — you throw a URL at it, and it returns clean markdown or text.
Here's what I learned building it, and how you can use it for free.
The Problem: Every Content Source Is Different
When you're building AI-powered apps — chatbots, RAG systems, content analyzers — you spend 80% of your time on data ingestion. Each source has its own quirks:
- YouTube: Rate limits, API keys, no transcripts for some videos
- Web pages: JavaScript rendering, paywalls, cookie banners, cluttered HTML
- Twitter/X: Expensive API, thread reconstruction, media handling
- Reddit: Rate limiting, nested comment trees, deleted content
I wanted one endpoint to rule them all.
The Solution: One URL In, Clean Content Out
Here's the simplest example — extracting a web page:
import requests
response = requests.get(
"https://api.getcontentapi.com/api/v1/web",
params={"url": "https://example.com/article"},
headers={"X-API-Key": "your_api_key"}
)
data = response.json()
print(data["data"]["title"])
print(data["data"]["content"]) # Clean markdown
print(data["data"]["word_count"])
That's it. No parsing HTML, no handling JavaScript, no dealing with cookie banners.
YouTube Transcripts Without the YouTube API
This is probably the feature I'm most proud of. Getting YouTube transcripts is notoriously painful:
# Extract transcript from any YouTube video
response = requests.get(
"https://api.getcontentapi.com/api/v1/youtube/transcript",
params={"url": "https://youtube.com/watch?v=dQw4w9WgXcQ"},
headers={"X-API-Key": "your_api_key"}
)
data = response.json()["data"]
print(f"Title: {data['title']}")
print(f"Channel: {data['channel']}")
print(f"Duration: {data['duration']}s")
print(f"Transcript: {data['full_text'][:200]}...")
What makes it special:
- No YouTube API key needed — ContentAPI handles it
- AI fallback — if no captions exist, we use Whisper to transcribe the audio
- Multi-language support — request transcripts in specific languages
-
Timestamped segments — get
start,duration, andtextfor each segment
The Python SDK
I also built a Python SDK to make things even simpler:
pip install contentapi
from contentapi import ContentAPI
client = ContentAPI(api_key="your_api_key")
# Extract from any URL — auto-detects the source type
result = client.extract("https://youtube.com/watch?v=...")
print(result.title)
print(result.content)
# Batch processing — up to 10 URLs at once
results = client.batch([
"https://youtube.com/watch?v=abc",
"https://example.com/blog-post",
"https://reddit.com/r/python/comments/xyz"
])
Structured Data Extraction
Need specific fields from a page? Use the /extract endpoint with a schema:
response = requests.post(
"https://api.getcontentapi.com/api/v1/extract",
headers={"X-API-Key": "your_api_key"},
json={
"url": "https://example.com/product-page",
"schema": {
"title": "string",
"price": "number",
"features": ["string"],
"in_stock": "boolean"
}
}
)
# Returns structured JSON matching your schema
print(response.json()["data"]["extracted"])
# {"title": "...", "price": 29.99, "features": [...], "in_stock": true}
This is AI-powered extraction — it understands the page content and maps it to your schema.
Website Crawling
Need to ingest an entire documentation site? The crawl endpoint handles it:
# Start a crawl
response = requests.post(
"https://api.getcontentapi.com/api/v1/crawl",
headers={"X-API-Key": "your_api_key"},
json={
"url": "https://docs.example.com",
"max_pages": 50,
"max_depth": 3,
"include_patterns": ["/docs/*"],
"format": "markdown"
}
)
crawl_id = response.json()["data"]["crawl_id"]
# Poll for results (or use webhook_url for push notifications)
import time
while True:
status = requests.get(
f"https://api.getcontentapi.com/api/v1/crawl/{crawl_id}",
headers={"X-API-Key": "your_api_key"}
).json()["data"]
if status["status"] in ["completed", "partial", "failed"]:
break
time.sleep(5)
# Process results
for page in status["results"]:
print(f"{page['title']}: {page['word_count']} words")
What I Learned Building This
1. Web scraping is an arms race
Every month, sites change their anti-bot measures. Building a reliable extraction service means constantly adapting. We use a mix of headless browsers, custom parsers, and fallback strategies.
2. The free tier matters
I wanted ContentAPI to be genuinely useful for hobbyists and indie devs. The free tier gives you 5,000 requests/month — enough to build and prototype real applications. No credit card required.
3. AI makes extraction dramatically better
Traditional scraping is brittle — one HTML change breaks everything. Using LLMs for structured extraction means the API understands content, not just DOM structure.
4. Developers want simplicity
The most popular feature request was "just give me one endpoint for everything." That's why the batch endpoint exists — throw any mix of URLs at it and get clean content back.
The MCP Server
For AI developers using Claude, Cursor, or Windsurf — there's also an MCP (Model Context Protocol) server:
npx contentapi-mcp-server
This lets AI assistants directly extract web content during conversations. Your AI can read any URL, get YouTube transcripts, and crawl websites — all through natural language.
Try It Free
- Website: getcontentapi.com
- Docs: getcontentapi.com/docs
-
Python SDK:
pip install contentapi -
MCP Server:
npx contentapi-mcp-server - Free tier: 5,000 requests/month, no credit card
I'd love to hear your feedback. What content sources would you want supported next? Drop a comment below! 👇
Top comments (0)