Comall Agency

Posted on Apr 14

I Built a URL-to-Markdown API for LLMs — Here's Why Existing Tools Fell Short

#ai #api #python #webdev

The Problem Every AI Developer Hits

You're building a RAG pipeline, an AI agent, or a chatbot that needs to read web pages. You write a quick requests.get(), parse the HTML, and... get a mess of navigation bars, cookie banners, ads, and broken formatting.

Sound familiar?

I hit this wall while building AgentIndex, an open registry for AI agents. My crawlers needed to extract clean, structured content from 25,000+ web pages — not raw HTML soup.

Here's what I tried and why each failed:

❌ `httpx` + BeautifulSoup

Result: 15% success rate on real-world URLs
Why: JavaScript-heavy sites return empty <body> tags. SPAs render nothing server-side.

❌ Basic Playwright

Result: 40% success rate
Why: Pages load async content after DOMContentLoaded. Without waiting for network idle + scroll simulation, you miss half the content.

❌ Existing solutions

Firecrawl: Great product, but not on RapidAPI. Credit multipliers make real cost 5-9x the advertised price.
Jina Reader: Free with rate limits, no AI summary, no quality score.
html-to-markdown converters on RapidAPI: They convert HTML you already have. They don't fetch anything.

There was literally no URL → clean Markdown API on RapidAPI's marketplace of 4M+ developers.

So I built one.

Introducing WebPulse

WebPulse converts any URL into clean, LLM-ready Markdown in one API call. It returns:

✅ Structured Markdown — cleaned of nav, ads, footers, scripts
✅ Metadata — title, author, publish date, language, detected automatically
✅ Quality Score (0-1) — so your pipeline knows if the content is usable
✅ LLM-Ready Context Block — pre-formatted with SOURCE, TITLE, DATE, LANG, token count
✅ Word count & reading time — useful for token budget estimation

What Makes It Different

Feature	WebPulse	Firecrawl	Jina Reader
On RapidAPI	✅	❌	❌
Headless browser	✅	✅	❌
Quality score	✅	❌	❌
LLM context block	✅	❌	❌
Free tier	✅ 50 req/mo	❌	✅ (rate limited)
JS-heavy sites	✅ 95%+	✅	❌

Quick Start (2 Minutes)

1. Subscribe on RapidAPI

👉 WebPulse on RapidAPI

Free plan: 50 requests/month. No credit card required.

2. Make Your First Call

Python:

import requests

url = "https://webpulse-url-to-markdown-for-llms.p.rapidapi.com/scrape"

payload = {"url": "https://example.com/blog-post"}
headers = {
    "Content-Type": "application/json",
    "X-RapidAPI-Key": "YOUR_API_KEY",
    "X-RapidAPI-Host": "webpulse-url-to-markdown-for-llms.p.rapidapi.com"
}

response = requests.post(url, json=payload, headers=headers)
data = response.json()

print(data["markdown"])        # Clean markdown content
print(data["quality_score"])   # 0.0 to 1.0
print(data["metadata"]["title"])

JavaScript:

const response = await fetch(
  "https://webpulse-url-to-markdown-for-llms.p.rapidapi.com/scrape",
  {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "X-RapidAPI-Key": "YOUR_API_KEY",
      "X-RapidAPI-Host": "webpulse-url-to-markdown-for-llms.p.rapidapi.com"
    },
    body: JSON.stringify({ url: "https://example.com/blog-post" })
  }
);

const data = await response.json();
console.log(data.markdown);
console.log(data.quality_score);

cURL:

curl -X POST \
  "https://webpulse-url-to-markdown-for-llms.p.rapidapi.com/scrape" \
  -H "Content-Type: application/json" \
  -H "X-RapidAPI-Key: YOUR_API_KEY" \
  -H "X-RapidAPI-Host: webpulse-url-to-markdown-for-llms.p.rapidapi.com" \
  -d '{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence"}'

Real-World Response Example

Here's what you get when scraping a Wikipedia article:

{
  "success": true,
  "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
  "method_used": "playwright_readability",
  "processing_time_ms": 9500,
  "quality_score": 0.9,
  "markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the capability of computational systems...",
  "metadata": {
    "title": "Artificial intelligence - Wikipedia",
    "author": null,
    "published_date": null,
    "language": "en",
    "word_count": 29631,
    "reading_time_min": 148
  },
  "llm_ready": {
    "context_block": "SOURCE: en.wikipedia.org | TITLE: Artificial intelligence - Wikipedia | LANG: en\n\nArtificial intelligence (AI) is...",
    "token_count": 3980
  }
}

The quality_score of 0.9 tells your pipeline: "this content is clean and usable." Anything below 0.4? Your code can skip it automatically.

3 Endpoints, 3 Use Cases

`POST /scrape` — The Main Endpoint

Give it a URL, get back clean Markdown. Uses Playwright headless browser with smart fallbacks:

First tries readability algorithm (fast)
Falls back to js_extract for stubborn sites
Scrolls the page to trigger lazy-loaded content
Waits for network idle before extraction

`POST /convert` — HTML You Already Have

Already fetched the HTML yourself? Send it directly. No browser needed, instant conversion. Free on all plans.

`GET /health` — System Status

Check if the browser pool is warm, cache is connected, and the system is operational.

Use Cases for Developers

🔗 RAG Pipelines

Feed web content into your vector database. The quality score filters out low-value pages automatically.

data = webpulse_scrape(url)
if data["quality_score"] >= 0.6:
    chunks = split_into_chunks(data["markdown"])
    vector_db.upsert(chunks, metadata=data["metadata"])

🤖 AI Agents

Let your agent browse the web and read pages. The LLM-ready context block is pre-formatted for injection into prompts.

context = data["llm_ready"]["context_block"]
prompt = f"Based on this source:\n{context}\n\nAnswer: {user_question}"

📊 Content Monitoring

Track changes on competitor pages, news sites, or documentation. Compare Markdown diffs over time.

🔍 Search Engine for LLMs

Build a search tool that fetches and summarizes web results in real-time.

How It Works Under the Hood

WebPulse runs a Playwright browser pool on a dedicated server. Here's the pipeline:

URL → Validate → Browser Pool → Page Load (networkidle)
    → Scroll Simulation → Content Extraction (readability + JS fallback)
    → Noise Removal → Markdown Conversion → Metadata Extraction
    → Quality Scoring → LLM Context Block → Cache → Response

Key technical decisions:

Browser pool: 3 persistent Chromium instances, recycled every 50 pages to prevent memory leaks
Smart waiting: networkidle + 2s scroll delay catches 95%+ of JS-rendered content
Dual extraction: readability algorithm first, document.body.innerText fallback saves 8 additional sites out of 20 in testing
Redis caching: Repeated URLs return instantly from cache
Quality scoring: Based on content length, text-to-HTML ratio, heading structure, and link density

Pricing

Plan	Price	Requests/mo	Rate Limit	Best For
Basic	Free	50	5/min	Testing & evaluation
Pro ⭐	$5/mo	1,000	30/min	Individual developers
Ultra	$19/mo	5,000	60/min	Startups & small teams
Mega	$49/mo	20,000	100/min	Production workloads

👉 Try it free on RapidAPI

FAQ

Q: What sites does it support?
95%+ of public websites. JavaScript SPAs, news sites, blogs, documentation, wikis — all work. Login-walled and paywall sites will return limited content (as expected).

Q: How fast is it?
Cached responses: instant. Fresh scrapes: 4-10 seconds depending on page complexity. The /convert endpoint (raw HTML input) is sub-second.

Q: Do you respect robots.txt?
Yes. WebPulse is designed for legitimate use cases like research, RAG, and content analysis.

Q: Can I use it with LangChain / LlamaIndex?
Absolutely. Just call the API in a custom loader and return the markdown + metadata.

What's Next

AI Summary endpoint — get a 3-sentence summary powered by LLMs (coming soon)
Batch scraping — submit multiple URLs in one call
Screenshot capture — get a visual snapshot alongside the markdown
Webhook support — async scraping with callback

Try It Now

The free tier gives you 50 requests/month — enough to test it in your pipeline and see if it fits.

👉 WebPulse on RapidAPI — Start Free

Built by Comall Agency. Questions? Drop a comment below or reach out on the RapidAPI community page.

If this helped you, consider leaving a ⭐ reaction — it helps other developers find this article!

DEV Community

I Built a URL-to-Markdown API for LLMs — Here's Why Existing Tools Fell Short

The Problem Every AI Developer Hits

❌ `httpx` + BeautifulSoup

❌ Basic Playwright

❌ Existing solutions

Introducing WebPulse

What Makes It Different

Quick Start (2 Minutes)

1. Subscribe on RapidAPI

2. Make Your First Call

Real-World Response Example

3 Endpoints, 3 Use Cases

`POST /scrape` — The Main Endpoint

`POST /convert` — HTML You Already Have

`GET /health` — System Status

Use Cases for Developers

🔗 RAG Pipelines

🤖 AI Agents

📊 Content Monitoring

🔍 Search Engine for LLMs

How It Works Under the Hood

Pricing

FAQ

What's Next

Try It Now

Top comments (0)

The Problem Every AI Developer Hits

❌ httpx + BeautifulSoup

❌ Basic Playwright

❌ Existing solutions

Introducing WebPulse

What Makes It Different

Quick Start (2 Minutes)

1. Subscribe on RapidAPI

2. Make Your First Call

Real-World Response Example

3 Endpoints, 3 Use Cases

POST /scrape — The Main Endpoint

POST /convert — HTML You Already Have

GET /health — System Status

Use Cases for Developers

🔗 RAG Pipelines

🤖 AI Agents

📊 Content Monitoring

🔍 Search Engine for LLMs

How It Works Under the Hood

Pricing

FAQ

What's Next

Try It Now

❌ `httpx` + BeautifulSoup

`POST /scrape` — The Main Endpoint

`POST /convert` — HTML You Already Have

`GET /health` — System Status