When building RAG systems or LLM-powered pipelines, you often don’t need a massive distributed crawler or a cloud scraping platform.
Most of the time, you just want to:
- Crawl a website deeply
- Convert pages into clean text (Markdown)
- Feed them into embeddings or downstream processing
However, many existing tools introduce complexity or overhead:
Scrapy is extremely powerful and flexible, but requires writing spiders, managing Python dependencies, and building custom pipelines.
Apify offers a full scraping platform, but relies on cloud infrastructure, subscriptions, and heavier runtime environments (Node.js/Python).
Firecrawl and similar APIs are great for large-scale ingestion, but can be overkill if you want reproducible, local-first CI workflows.
That’s why I built Mojo, a lightweight, cross-platform C++ web crawler designed specifically for LLM/RAG workflows.
Why building Mojo
Mojo focuses on one simple thing, efficiently crawling websites and producing clean, structured output suitable for LLM pipelines.
Compared to Python/Node-based crawlers, Mojo is significantly faster and lighter on CPU/RAM, making it ideal for cloud jobs, Lambdas, CI pipelines or cheap servers.
Quick Example
Crawl an entire documentation site up to depth 2 and export everything as Markdown:
./mojo -d 2 https://docs.example.com -o ./docs
For JS-rendered websites (SPAs):
./mojo --render https://spa-example.com -o ./docs_rendered
Note: --render requires Chromium/Chrome installed on the machine.
Using proxies:
./mojo -p socks5://127.0.0.1:9050 https://target.com
Or with a proxy list:
./mojo --config example_config.yaml https://target.com
Perfect for CI/CD Pipelines
Mojo was built with automation in mind.
Example GitHub Actions workflow:
name: Generate docs with Mojo
on:
workflow_dispatch:
schedule:
- cron: '0 3 * * *'
jobs:
crawl:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download Mojo
run: |
curl -L -o mojo https://github.com/malvads/mojo/releases/download/v0.1.0/mojo-0.1.0-linux-x86_64
chmod +x mojo
- name: Run crawler
run: ./mojo -d 2 https://docs.example.com -o ./generated_docs
When Should You Use Mojo?
- Want fast website → Markdown conversion
- Prefer local tools over cloud services
- Care about performance and reproducibility
- Are building RAG, search, or LLM pipelines
You might prefer heavier frameworks if you need advanced scraping logic per page or complex data extraction workflows
But for most LLM ingestion use cases, Mojo keeps things simple and efficient.
Mojo is fully open source under the MIT license.
Feel free to check out -> https://github.com/malvads/mojo :)
Top comments (0)