DEV Community

Дмитрий
Дмитрий

Posted on

Build Better RAG Pipelines: Scraping Technical Docs to Clean Markdown

Building a RAG (Retrieval-Augmented Generation) pipeline usually starts with a simple goal: "I want to chat with this documentation."

You write a quick script to scrape the site, feed it into your vector database, and... the results are garbage. 🗑️

The Problem with Generic Scraping

If you simply curl a documentation page or use a generic crawler, your LLM context gets flooded with noise:

  • Navigation menus repeated on every single page ("Home > Docs > API...").
  • Sidebars that confuse semantic search.
  • Footers, cookie banners, and scripts.
  • Broken code blocks that lose their language tags.

Your retrieval system ends up matching the "Terms of Service" link in the footer instead of the actual API method you were looking for.

The Solution: A Framework-Aware Scraper

I built Tech Docs to LLM-Ready Markdown to solve this exact problem.

Instead of treating every page as a bag of HTML tags, this actor detects the documentation framework (Docusaurus, GitBook, MkDocs, etc.) and intelligently extracts only the content you care about.

Tech Docs to Markdown for RAG & LLM · Apify

Scrape technical documentation from Docusaurus, GitBook, MkDocs and convert to clean Markdown for RAG pipelines and AI training.

favicon apify.com

🚀 Key Features for RAG Pipelines

Here is why this is better than writing your own BeautifulSoup script:

1. Smart Framework Detection

It automatically identifies the underlying tech stack and applies specialized extraction rules.

  • Docusaurus
  • GitBook
  • MkDocs (Material)
  • ReadTheDocs
  • VuePress / Nextra

2. Auto-Cleaning

It automatically strips out:

  • Sidebars & Top Navigation
  • "Edit this page" links
  • Table of Contents (redundant for embeddings)
  • Footers & Legal text

3. RAG-First Output Format 🤖

This is the game-changer. The scraper doesn't just output text; it outputs structured data designed for vector DBs:

  • doc_id: A stable, unique hash of the URL (great for deduplication).
  • section_path: The breadcrumb path (e.g., Guides > Advanced > Configuration). Essential for filtering retrieval results!
  • chunk_index: Built-in chunking support so you don't have to re-chunk huge pages later.

Example Output:

{
    "doc_id": "acdb145c14f4310b",
    "title": "Introduction | Crawlee",
    "section_path": "Guides > Quick Start > Introduction",
    "content": "# Introduction\n\nCrawlee covers your crawling...",
    "framework": "docusaurus",
    "metadata": {
        "wordCount": 358,
        "crawledAt": "2025-12-12T03:34:46.151Z"
    }
}
Enter fullscreen mode Exit fullscreen mode

🛠️ Integration with LangChain

Since the output is structured, loading it into LangChain is trivial using the ApifyDatasetLoader.

from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document

# Load results from Apify Dataset
loader = ApifyDatasetLoader(
    dataset_id="YOUR_DATASET_ID",
    dataset_mapping_function=lambda item: Document(
        page_content=item["content"],
        metadata={
            "source": item["url"],
            "title": item["title"],
            "doc_id": item["doc_id"],
            "section": item["section_path"] # <--- Filter by section later!
        }
    ),
)
docs = loader.load()

# Your docs are now ready for embeddings!
print(f"Loaded {len(docs)} clean documents.")
Enter fullscreen mode Exit fullscreen mode

📉 Cost & Performance

The actor uses a custom lightweight extraction engine (on top of Cheerio), so it's extremely fast and cheap.

  • Pricing: Pay-per-result ($0.50 per 1,000 pages).
  • Speed: Can process hundreds of pages per minute.

Try it out

If you are building an AI assistant for a library, SDK, or internal docs, give it a shot. It saves hours of data cleaning time.

Try Tech Docs Scraper

Let me know in the comments if there are other documentation frameworks you'd like me to add! 👇

Top comments (0)