DEV Community

Comall Agency
Comall Agency

Posted on

I Built a URL-to-Markdown API for LLMs — Here's Why Existing Tools Fell Short

The Problem Every AI Developer Hits

You're building a RAG pipeline, an AI agent, or a chatbot that needs to read web pages. You write a quick requests.get(), parse the HTML, and... get a mess of navigation bars, cookie banners, ads, and broken formatting.

Sound familiar?

I hit this wall while building AgentIndex, an open registry for AI agents. My crawlers needed to extract clean, structured content from 25,000+ web pages — not raw HTML soup.

Here's what I tried and why each failed:

httpx + BeautifulSoup

  • Result: 15% success rate on real-world URLs
  • Why: JavaScript-heavy sites return empty <body> tags. SPAs render nothing server-side.

❌ Basic Playwright

  • Result: 40% success rate
  • Why: Pages load async content after DOMContentLoaded. Without waiting for network idle + scroll simulation, you miss half the content.

❌ Existing solutions

  • Firecrawl: Great product, but not on RapidAPI. Credit multipliers make real cost 5-9x the advertised price.
  • Jina Reader: Free with rate limits, no AI summary, no quality score.
  • html-to-markdown converters on RapidAPI: They convert HTML you already have. They don't fetch anything.

There was literally no URL → clean Markdown API on RapidAPI's marketplace of 4M+ developers.

So I built one.


Introducing WebPulse

WebPulse converts any URL into clean, LLM-ready Markdown in one API call. It returns:

  • Structured Markdown — cleaned of nav, ads, footers, scripts
  • Metadata — title, author, publish date, language, detected automatically
  • Quality Score (0-1) — so your pipeline knows if the content is usable
  • LLM-Ready Context Block — pre-formatted with SOURCE, TITLE, DATE, LANG, token count
  • Word count & reading time — useful for token budget estimation

What Makes It Different

Feature WebPulse Firecrawl Jina Reader
On RapidAPI
Headless browser
Quality score
LLM context block
Free tier ✅ 50 req/mo ✅ (rate limited)
JS-heavy sites ✅ 95%+

Quick Start (2 Minutes)

1. Subscribe on RapidAPI

👉 WebPulse on RapidAPI

Free plan: 50 requests/month. No credit card required.

2. Make Your First Call

Python:

import requests

url = "https://webpulse-url-to-markdown-for-llms.p.rapidapi.com/scrape"

payload = {"url": "https://example.com/blog-post"}
headers = {
    "Content-Type": "application/json",
    "X-RapidAPI-Key": "YOUR_API_KEY",
    "X-RapidAPI-Host": "webpulse-url-to-markdown-for-llms.p.rapidapi.com"
}

response = requests.post(url, json=payload, headers=headers)
data = response.json()

print(data["markdown"])        # Clean markdown content
print(data["quality_score"])   # 0.0 to 1.0
print(data["metadata"]["title"])
Enter fullscreen mode Exit fullscreen mode

JavaScript:

const response = await fetch(
  "https://webpulse-url-to-markdown-for-llms.p.rapidapi.com/scrape",
  {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "X-RapidAPI-Key": "YOUR_API_KEY",
      "X-RapidAPI-Host": "webpulse-url-to-markdown-for-llms.p.rapidapi.com"
    },
    body: JSON.stringify({ url: "https://example.com/blog-post" })
  }
);

const data = await response.json();
console.log(data.markdown);
console.log(data.quality_score);
Enter fullscreen mode Exit fullscreen mode

cURL:

curl -X POST \
  "https://webpulse-url-to-markdown-for-llms.p.rapidapi.com/scrape" \
  -H "Content-Type: application/json" \
  -H "X-RapidAPI-Key: YOUR_API_KEY" \
  -H "X-RapidAPI-Host: webpulse-url-to-markdown-for-llms.p.rapidapi.com" \
  -d '{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence"}'
Enter fullscreen mode Exit fullscreen mode

Real-World Response Example

Here's what you get when scraping a Wikipedia article:

{
  "success": true,
  "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
  "method_used": "playwright_readability",
  "processing_time_ms": 9500,
  "quality_score": 0.9,
  "markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the capability of computational systems...",
  "metadata": {
    "title": "Artificial intelligence - Wikipedia",
    "author": null,
    "published_date": null,
    "language": "en",
    "word_count": 29631,
    "reading_time_min": 148
  },
  "llm_ready": {
    "context_block": "SOURCE: en.wikipedia.org | TITLE: Artificial intelligence - Wikipedia | LANG: en\n\nArtificial intelligence (AI) is...",
    "token_count": 3980
  }
}
Enter fullscreen mode Exit fullscreen mode

The quality_score of 0.9 tells your pipeline: "this content is clean and usable." Anything below 0.4? Your code can skip it automatically.


3 Endpoints, 3 Use Cases

POST /scrape — The Main Endpoint

Give it a URL, get back clean Markdown. Uses Playwright headless browser with smart fallbacks:

  1. First tries readability algorithm (fast)
  2. Falls back to js_extract for stubborn sites
  3. Scrolls the page to trigger lazy-loaded content
  4. Waits for network idle before extraction

POST /convert — HTML You Already Have

Already fetched the HTML yourself? Send it directly. No browser needed, instant conversion. Free on all plans.

GET /health — System Status

Check if the browser pool is warm, cache is connected, and the system is operational.


Use Cases for Developers

🔗 RAG Pipelines

Feed web content into your vector database. The quality score filters out low-value pages automatically.

data = webpulse_scrape(url)
if data["quality_score"] >= 0.6:
    chunks = split_into_chunks(data["markdown"])
    vector_db.upsert(chunks, metadata=data["metadata"])
Enter fullscreen mode Exit fullscreen mode

🤖 AI Agents

Let your agent browse the web and read pages. The LLM-ready context block is pre-formatted for injection into prompts.

context = data["llm_ready"]["context_block"]
prompt = f"Based on this source:\n{context}\n\nAnswer: {user_question}"
Enter fullscreen mode Exit fullscreen mode

📊 Content Monitoring

Track changes on competitor pages, news sites, or documentation. Compare Markdown diffs over time.

🔍 Search Engine for LLMs

Build a search tool that fetches and summarizes web results in real-time.


How It Works Under the Hood

WebPulse runs a Playwright browser pool on a dedicated server. Here's the pipeline:

URL → Validate → Browser Pool → Page Load (networkidle)
    → Scroll Simulation → Content Extraction (readability + JS fallback)
    → Noise Removal → Markdown Conversion → Metadata Extraction
    → Quality Scoring → LLM Context Block → Cache → Response
Enter fullscreen mode Exit fullscreen mode

Key technical decisions:

  • Browser pool: 3 persistent Chromium instances, recycled every 50 pages to prevent memory leaks
  • Smart waiting: networkidle + 2s scroll delay catches 95%+ of JS-rendered content
  • Dual extraction: readability algorithm first, document.body.innerText fallback saves 8 additional sites out of 20 in testing
  • Redis caching: Repeated URLs return instantly from cache
  • Quality scoring: Based on content length, text-to-HTML ratio, heading structure, and link density

Pricing

Plan Price Requests/mo Rate Limit Best For
Basic Free 50 5/min Testing & evaluation
Pro $5/mo 1,000 30/min Individual developers
Ultra $19/mo 5,000 60/min Startups & small teams
Mega $49/mo 20,000 100/min Production workloads

👉 Try it free on RapidAPI


FAQ

Q: What sites does it support?
95%+ of public websites. JavaScript SPAs, news sites, blogs, documentation, wikis — all work. Login-walled and paywall sites will return limited content (as expected).

Q: How fast is it?
Cached responses: instant. Fresh scrapes: 4-10 seconds depending on page complexity. The /convert endpoint (raw HTML input) is sub-second.

Q: Do you respect robots.txt?
Yes. WebPulse is designed for legitimate use cases like research, RAG, and content analysis.

Q: Can I use it with LangChain / LlamaIndex?
Absolutely. Just call the API in a custom loader and return the markdown + metadata.


What's Next

  • AI Summary endpoint — get a 3-sentence summary powered by LLMs (coming soon)
  • Batch scraping — submit multiple URLs in one call
  • Screenshot capture — get a visual snapshot alongside the markdown
  • Webhook support — async scraping with callback

Try It Now

The free tier gives you 50 requests/month — enough to test it in your pipeline and see if it fits.

👉 WebPulse on RapidAPI — Start Free

Built by Comall Agency. Questions? Drop a comment below or reach out on the RapidAPI community page.


If this helped you, consider leaving a ⭐ reaction — it helps other developers find this article!

Top comments (0)