DEV Community

Miguel Álvarez
Miguel Álvarez

Posted on

Mojo: A Lightweight C++ Web Crawler for converting websites to RAG ready data (Fast, Simple, CI/CD-Friendly)

When building RAG systems or LLM-powered pipelines, you often don’t need a massive distributed crawler or a cloud scraping platform.

Most of the time, you just want to:

  • Crawl a website deeply
  • Convert pages into clean text (Markdown)
  • Feed them into embeddings or downstream processing

However, many existing tools introduce complexity or overhead:

Scrapy is extremely powerful and flexible, but requires writing spiders, managing Python dependencies, and building custom pipelines.

Apify offers a full scraping platform, but relies on cloud infrastructure, subscriptions, and heavier runtime environments (Node.js/Python).

Firecrawl and similar APIs are great for large-scale ingestion, but can be overkill if you want reproducible, local-first CI workflows.

That’s why I built Mojo, a lightweight, cross-platform C++ web crawler designed specifically for LLM/RAG workflows.

Why building Mojo

Mojo focuses on one simple thing, efficiently crawling websites and producing clean, structured output suitable for LLM pipelines.

Compared to Python/Node-based crawlers, Mojo is significantly faster and lighter on CPU/RAM, making it ideal for cloud jobs, Lambdas, CI pipelines or cheap servers.

Quick Example

Crawl an entire documentation site up to depth 2 and export everything as Markdown:

./mojo -d 2 https://docs.example.com -o ./docs

For JS-rendered websites (SPAs):

./mojo --render https://spa-example.com -o ./docs_rendered

Note: --render requires Chromium/Chrome installed on the machine.

Using proxies:

./mojo -p socks5://127.0.0.1:9050 https://target.com

Or with a proxy list:

./mojo --config example_config.yaml https://target.com

Perfect for CI/CD Pipelines

Mojo was built with automation in mind.

Example GitHub Actions workflow:

name: Generate docs with Mojo

on:
  workflow_dispatch:
  schedule:
    - cron: '0 3 * * *'

jobs:
  crawl:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download Mojo
        run: |
          curl -L -o mojo https://github.com/malvads/mojo/releases/download/v0.1.0/mojo-0.1.0-linux-x86_64
          chmod +x mojo

      - name: Run crawler
        run: ./mojo -d 2 https://docs.example.com -o ./generated_docs
Enter fullscreen mode Exit fullscreen mode

When Should You Use Mojo?

  • Want fast website → Markdown conversion
  • Prefer local tools over cloud services
  • Care about performance and reproducibility
  • Are building RAG, search, or LLM pipelines

You might prefer heavier frameworks if you need advanced scraping logic per page or complex data extraction workflows

But for most LLM ingestion use cases, Mojo keeps things simple and efficient.

Mojo is fully open source under the MIT license.

Feel free to check out -> https://github.com/malvads/mojo :)

Top comments (0)