The Problem Every AI Developer Hits
You're building a RAG pipeline, an AI agent, or a chatbot that needs to read web pages. You write a quick requests.get(), parse the HTML, and... get a mess of navigation bars, cookie banners, ads, and broken formatting.
Sound familiar?
I hit this wall while building AgentIndex, an open registry for AI agents. My crawlers needed to extract clean, structured content from 25,000+ web pages — not raw HTML soup.
Here's what I tried and why each failed:
❌ httpx + BeautifulSoup
- Result: 15% success rate on real-world URLs
-
Why: JavaScript-heavy sites return empty
<body>tags. SPAs render nothing server-side.
❌ Basic Playwright
- Result: 40% success rate
-
Why: Pages load async content after
DOMContentLoaded. Without waiting for network idle + scroll simulation, you miss half the content.
❌ Existing solutions
- Firecrawl: Great product, but not on RapidAPI. Credit multipliers make real cost 5-9x the advertised price.
- Jina Reader: Free with rate limits, no AI summary, no quality score.
- html-to-markdown converters on RapidAPI: They convert HTML you already have. They don't fetch anything.
There was literally no URL → clean Markdown API on RapidAPI's marketplace of 4M+ developers.
So I built one.
Introducing WebPulse
WebPulse converts any URL into clean, LLM-ready Markdown in one API call. It returns:
- ✅ Structured Markdown — cleaned of nav, ads, footers, scripts
- ✅ Metadata — title, author, publish date, language, detected automatically
- ✅ Quality Score (0-1) — so your pipeline knows if the content is usable
- ✅ LLM-Ready Context Block — pre-formatted with SOURCE, TITLE, DATE, LANG, token count
- ✅ Word count & reading time — useful for token budget estimation
What Makes It Different
| Feature | WebPulse | Firecrawl | Jina Reader |
|---|---|---|---|
| On RapidAPI | ✅ | ❌ | ❌ |
| Headless browser | ✅ | ✅ | ❌ |
| Quality score | ✅ | ❌ | ❌ |
| LLM context block | ✅ | ❌ | ❌ |
| Free tier | ✅ 50 req/mo | ❌ | ✅ (rate limited) |
| JS-heavy sites | ✅ 95%+ | ✅ | ❌ |
Quick Start (2 Minutes)
1. Subscribe on RapidAPI
Free plan: 50 requests/month. No credit card required.
2. Make Your First Call
Python:
import requests
url = "https://webpulse-url-to-markdown-for-llms.p.rapidapi.com/scrape"
payload = {"url": "https://example.com/blog-post"}
headers = {
"Content-Type": "application/json",
"X-RapidAPI-Key": "YOUR_API_KEY",
"X-RapidAPI-Host": "webpulse-url-to-markdown-for-llms.p.rapidapi.com"
}
response = requests.post(url, json=payload, headers=headers)
data = response.json()
print(data["markdown"]) # Clean markdown content
print(data["quality_score"]) # 0.0 to 1.0
print(data["metadata"]["title"])
JavaScript:
const response = await fetch(
"https://webpulse-url-to-markdown-for-llms.p.rapidapi.com/scrape",
{
method: "POST",
headers: {
"Content-Type": "application/json",
"X-RapidAPI-Key": "YOUR_API_KEY",
"X-RapidAPI-Host": "webpulse-url-to-markdown-for-llms.p.rapidapi.com"
},
body: JSON.stringify({ url: "https://example.com/blog-post" })
}
);
const data = await response.json();
console.log(data.markdown);
console.log(data.quality_score);
cURL:
curl -X POST \
"https://webpulse-url-to-markdown-for-llms.p.rapidapi.com/scrape" \
-H "Content-Type: application/json" \
-H "X-RapidAPI-Key: YOUR_API_KEY" \
-H "X-RapidAPI-Host: webpulse-url-to-markdown-for-llms.p.rapidapi.com" \
-d '{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence"}'
Real-World Response Example
Here's what you get when scraping a Wikipedia article:
{
"success": true,
"url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"method_used": "playwright_readability",
"processing_time_ms": 9500,
"quality_score": 0.9,
"markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the capability of computational systems...",
"metadata": {
"title": "Artificial intelligence - Wikipedia",
"author": null,
"published_date": null,
"language": "en",
"word_count": 29631,
"reading_time_min": 148
},
"llm_ready": {
"context_block": "SOURCE: en.wikipedia.org | TITLE: Artificial intelligence - Wikipedia | LANG: en\n\nArtificial intelligence (AI) is...",
"token_count": 3980
}
}
The quality_score of 0.9 tells your pipeline: "this content is clean and usable." Anything below 0.4? Your code can skip it automatically.
3 Endpoints, 3 Use Cases
POST /scrape — The Main Endpoint
Give it a URL, get back clean Markdown. Uses Playwright headless browser with smart fallbacks:
- First tries
readabilityalgorithm (fast) - Falls back to
js_extractfor stubborn sites - Scrolls the page to trigger lazy-loaded content
- Waits for network idle before extraction
POST /convert — HTML You Already Have
Already fetched the HTML yourself? Send it directly. No browser needed, instant conversion. Free on all plans.
GET /health — System Status
Check if the browser pool is warm, cache is connected, and the system is operational.
Use Cases for Developers
🔗 RAG Pipelines
Feed web content into your vector database. The quality score filters out low-value pages automatically.
data = webpulse_scrape(url)
if data["quality_score"] >= 0.6:
chunks = split_into_chunks(data["markdown"])
vector_db.upsert(chunks, metadata=data["metadata"])
🤖 AI Agents
Let your agent browse the web and read pages. The LLM-ready context block is pre-formatted for injection into prompts.
context = data["llm_ready"]["context_block"]
prompt = f"Based on this source:\n{context}\n\nAnswer: {user_question}"
📊 Content Monitoring
Track changes on competitor pages, news sites, or documentation. Compare Markdown diffs over time.
🔍 Search Engine for LLMs
Build a search tool that fetches and summarizes web results in real-time.
How It Works Under the Hood
WebPulse runs a Playwright browser pool on a dedicated server. Here's the pipeline:
URL → Validate → Browser Pool → Page Load (networkidle)
→ Scroll Simulation → Content Extraction (readability + JS fallback)
→ Noise Removal → Markdown Conversion → Metadata Extraction
→ Quality Scoring → LLM Context Block → Cache → Response
Key technical decisions:
- Browser pool: 3 persistent Chromium instances, recycled every 50 pages to prevent memory leaks
-
Smart waiting:
networkidle+ 2s scroll delay catches 95%+ of JS-rendered content -
Dual extraction:
readabilityalgorithm first,document.body.innerTextfallback saves 8 additional sites out of 20 in testing - Redis caching: Repeated URLs return instantly from cache
- Quality scoring: Based on content length, text-to-HTML ratio, heading structure, and link density
Pricing
| Plan | Price | Requests/mo | Rate Limit | Best For |
|---|---|---|---|---|
| Basic | Free | 50 | 5/min | Testing & evaluation |
| Pro ⭐ | $5/mo | 1,000 | 30/min | Individual developers |
| Ultra | $19/mo | 5,000 | 60/min | Startups & small teams |
| Mega | $49/mo | 20,000 | 100/min | Production workloads |
FAQ
Q: What sites does it support?
95%+ of public websites. JavaScript SPAs, news sites, blogs, documentation, wikis — all work. Login-walled and paywall sites will return limited content (as expected).
Q: How fast is it?
Cached responses: instant. Fresh scrapes: 4-10 seconds depending on page complexity. The /convert endpoint (raw HTML input) is sub-second.
Q: Do you respect robots.txt?
Yes. WebPulse is designed for legitimate use cases like research, RAG, and content analysis.
Q: Can I use it with LangChain / LlamaIndex?
Absolutely. Just call the API in a custom loader and return the markdown + metadata.
What's Next
- AI Summary endpoint — get a 3-sentence summary powered by LLMs (coming soon)
- Batch scraping — submit multiple URLs in one call
- Screenshot capture — get a visual snapshot alongside the markdown
- Webhook support — async scraping with callback
Try It Now
The free tier gives you 50 requests/month — enough to test it in your pipeline and see if it fits.
👉 WebPulse on RapidAPI — Start Free
Built by Comall Agency. Questions? Drop a comment below or reach out on the RapidAPI community page.
If this helped you, consider leaving a ⭐ reaction — it helps other developers find this article!
Top comments (0)