Stop feeding garbage to your LLM: How to get clean Markdown from Documentation

Дмитрий — Sat, 13 Dec 2025 14:09:02 +0000

Building a RAG (Retrieval-Augmented Generation) pipeline sounds easy until you hit the data ingestion step.

If you are trying to build a "Chat with Docs" app for a modern framework (like Next.js, Stripe, or Supabase), you know the pain:

Hydration issues: Standard fetch or BeautifulSoup get an empty div because the content loads via JS.
Noise: You scrape the content, but you also get the navbar, the footer, the "Copyright 2025", and the "Sign Up" button. All this junk wastes your context window tokens.
Broken formatting: Code blocks lose their structure, and tables turn into a mess.

The Solution

I got tired of fixing these issues manually for every project, so I built a specialized Actor on Apify designed specifically for RAG pipelines.

It does three things:

Uses a headless browser to wait for the page to fully hydrate.
Smart extraction: It identifies the main content area (<article>, main, etc.) and strips away the UI noise.
Markdown conversion: It turns the HTML into clean Markdown, preserving code blocks and tables.

How to use it

You can try it for free on Apify. You just plug in the URL of the documentation (e.g., https://docs.stripe.com/) and get a JSON/Markdown file ready for your Vector Database.

👉 Link to the tool: https://apify.com/hedelka/tech-docs-scraper

I'm currently using it to feed Pinecone for my personal projects. Let me know if it helps with your data ingestion layer!

Build Better RAG Pipelines: Scraping Technical Docs to Clean Markdown

Дмитрий — Fri, 12 Dec 2025 03:56:06 +0000

Building a RAG (Retrieval-Augmented Generation) pipeline usually starts with a simple goal: "I want to chat with this documentation."

You write a quick script to scrape the site, feed it into your vector database, and... the results are garbage. 🗑️

The Problem with Generic Scraping

If you simply curl a documentation page or use a generic crawler, your LLM context gets flooded with noise:

Navigation menus repeated on every single page ("Home > Docs > API...").
Sidebars that confuse semantic search.
Footers, cookie banners, and scripts.
Broken code blocks that lose their language tags.

Your retrieval system ends up matching the "Terms of Service" link in the footer instead of the actual API method you were looking for.

The Solution: A Framework-Aware Scraper

I built Tech Docs to LLM-Ready Markdown to solve this exact problem.

Instead of treating every page as a bag of HTML tags, this actor detects the documentation framework (Docusaurus, GitBook, MkDocs, etc.) and intelligently extracts only the content you care about.

Tech Docs to Markdown for RAG & LLM · Apify

Scrape technical documentation from Docusaurus, GitBook, MkDocs and convert to clean Markdown for RAG pipelines and AI training.

apify.com

🚀 Key Features for RAG Pipelines

Here is why this is better than writing your own BeautifulSoup script:

1. Smart Framework Detection

It automatically identifies the underlying tech stack and applies specialized extraction rules.

✅ Docusaurus
✅ GitBook
✅ MkDocs (Material)
✅ ReadTheDocs
✅ VuePress / Nextra

2. Auto-Cleaning

It automatically strips out:

Sidebars & Top Navigation
"Edit this page" links
Table of Contents (redundant for embeddings)
Footers & Legal text

3. RAG-First Output Format 🤖

This is the game-changer. The scraper doesn't just output text; it outputs structured data designed for vector DBs:

doc_id: A stable, unique hash of the URL (great for deduplication).
section_path: The breadcrumb path (e.g., Guides > Advanced > Configuration). Essential for filtering retrieval results!
chunk_index: Built-in chunking support so you don't have to re-chunk huge pages later.

Example Output:

{
    "doc_id": "acdb145c14f4310b",
    "title": "Introduction | Crawlee",
    "section_path": "Guides > Quick Start > Introduction",
    "content": "# Introduction\n\nCrawlee covers your crawling...",
    "framework": "docusaurus",
    "metadata": {
        "wordCount": 358,
        "crawledAt": "2025-12-12T03:34:46.151Z"
    }
}

🛠️ Integration with LangChain

Since the output is structured, loading it into LangChain is trivial using the ApifyDatasetLoader.

from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document

# Load results from Apify Dataset
loader = ApifyDatasetLoader(
    dataset_id="YOUR_DATASET_ID",
    dataset_mapping_function=lambda item: Document(
        page_content=item["content"],
        metadata={
            "source": item["url"],
            "title": item["title"],
            "doc_id": item["doc_id"],
            "section": item["section_path"] # <--- Filter by section later!
        }
    ),
)
docs = loader.load()

# Your docs are now ready for embeddings!
print(f"Loaded {len(docs)} clean documents.")

📉 Cost & Performance

The actor uses a custom lightweight extraction engine (on top of Cheerio), so it's extremely fast and cheap.

Pricing: Pay-per-result ($0.50 per 1,000 pages).
Speed: Can process hundreds of pages per minute.

Try it out

If you are building an AI assistant for a library, SDK, or internal docs, give it a shot. It saves hours of data cleaning time.

Try Tech Docs Scraper

Let me know in the comments if there are other documentation frameworks you'd like me to add! 👇

DEV Community: Дмитрий