Building a RAG (Retrieval-Augmented Generation) pipeline usually starts with a simple goal: "I want to chat with this documentation."
You write a quick script to scrape the site, feed it into your vector database, and... the results are garbage. 🗑️
The Problem with Generic Scraping
If you simply curl a documentation page or use a generic crawler, your LLM context gets flooded with noise:
- Navigation menus repeated on every single page ("Home > Docs > API...").
- Sidebars that confuse semantic search.
- Footers, cookie banners, and scripts.
- Broken code blocks that lose their language tags.
Your retrieval system ends up matching the "Terms of Service" link in the footer instead of the actual API method you were looking for.
The Solution: A Framework-Aware Scraper
I built Tech Docs to LLM-Ready Markdown to solve this exact problem.
Instead of treating every page as a bag of HTML tags, this actor detects the documentation framework (Docusaurus, GitBook, MkDocs, etc.) and intelligently extracts only the content you care about.
🚀 Key Features for RAG Pipelines
Here is why this is better than writing your own BeautifulSoup script:
1. Smart Framework Detection
It automatically identifies the underlying tech stack and applies specialized extraction rules.
- ✅ Docusaurus
- ✅ GitBook
- ✅ MkDocs (Material)
- ✅ ReadTheDocs
- ✅ VuePress / Nextra
2. Auto-Cleaning
It automatically strips out:
- Sidebars & Top Navigation
- "Edit this page" links
- Table of Contents (redundant for embeddings)
- Footers & Legal text
3. RAG-First Output Format 🤖
This is the game-changer. The scraper doesn't just output text; it outputs structured data designed for vector DBs:
-
doc_id: A stable, unique hash of the URL (great for deduplication). -
section_path: The breadcrumb path (e.g.,Guides > Advanced > Configuration). Essential for filtering retrieval results! -
chunk_index: Built-in chunking support so you don't have to re-chunk huge pages later.
Example Output:
{
"doc_id": "acdb145c14f4310b",
"title": "Introduction | Crawlee",
"section_path": "Guides > Quick Start > Introduction",
"content": "# Introduction\n\nCrawlee covers your crawling...",
"framework": "docusaurus",
"metadata": {
"wordCount": 358,
"crawledAt": "2025-12-12T03:34:46.151Z"
}
}
🛠️ Integration with LangChain
Since the output is structured, loading it into LangChain is trivial using the ApifyDatasetLoader.
from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document
# Load results from Apify Dataset
loader = ApifyDatasetLoader(
dataset_id="YOUR_DATASET_ID",
dataset_mapping_function=lambda item: Document(
page_content=item["content"],
metadata={
"source": item["url"],
"title": item["title"],
"doc_id": item["doc_id"],
"section": item["section_path"] # <--- Filter by section later!
}
),
)
docs = loader.load()
# Your docs are now ready for embeddings!
print(f"Loaded {len(docs)} clean documents.")
📉 Cost & Performance
The actor uses a custom lightweight extraction engine (on top of Cheerio), so it's extremely fast and cheap.
- Pricing: Pay-per-result ($0.50 per 1,000 pages).
- Speed: Can process hundreds of pages per minute.
Try it out
If you are building an AI assistant for a library, SDK, or internal docs, give it a shot. It saves hours of data cleaning time.
Let me know in the comments if there are other documentation frameworks you'd like me to add! 👇
Top comments (0)