Mojo: A Lightweight C++ Web Crawler for converting websites to RAG ready data (Fast, Simple, CI/CD-Friendly)

#cpp #rag #showdev #webscraping

When building RAG systems or LLM-powered pipelines, you often don’t need a massive distributed crawler or a cloud scraping platform.

Most of the time, you just want to:

Crawl a website deeply
Convert pages into clean text (Markdown)
Feed them into embeddings or downstream processing

However, many existing tools introduce complexity or overhead:

Scrapy is extremely powerful and flexible, but requires writing spiders, managing Python dependencies, and building custom pipelines.

Apify offers a full scraping platform, but relies on cloud infrastructure, subscriptions, and heavier runtime environments (Node.js/Python).

Firecrawl and similar APIs are great for large-scale ingestion, but can be overkill if you want reproducible, local-first CI workflows.

That’s why I built Mojo, a lightweight, cross-platform C++ web crawler designed specifically for LLM/RAG workflows.

Why building Mojo

Mojo focuses on one simple thing, efficiently crawling websites and producing clean, structured output suitable for LLM pipelines.

Compared to Python/Node-based crawlers, Mojo is significantly faster and lighter on CPU/RAM, making it ideal for cloud jobs, Lambdas, CI pipelines or cheap servers.

Quick Example

Crawl an entire documentation site up to depth 2 and export everything as Markdown:

./mojo -d 2 https://docs.example.com -o ./docs

For JS-rendered websites (SPAs):

./mojo --render https://spa-example.com -o ./docs_rendered

Note: --render requires Chromium/Chrome installed on the machine.

Using proxies:

./mojo -p socks5://127.0.0.1:9050 https://target.com

Or with a proxy list:

./mojo --config example_config.yaml https://target.com

Perfect for CI/CD Pipelines

Mojo was built with automation in mind.

Example GitHub Actions workflow:

name: Generate docs with Mojo

on:
  workflow_dispatch:
  schedule:
    - cron: '0 3 * * *'

jobs:
  crawl:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download Mojo
        run: |
          curl -L -o mojo https://github.com/malvads/mojo/releases/download/v0.1.0/mojo-0.1.0-linux-x86_64
          chmod +x mojo

      - name: Run crawler
        run: ./mojo -d 2 https://docs.example.com -o ./generated_docs