LLM applications require clean, structured input data. Web content often contains HTML markup, navigation elements, and irrelevant content that degrades LLM performance. Converting HTML to clean Markdown while extracting only relevant content remains a common challenge.
mq-crawler
addresses this by combining web crawling with HTML-to-Markdown conversion and content filtering through mq queries.
Single Binary Deployment
mq-crawler
(distributed as mqcr
) is a standalone binary. The single executable includes the complete web crawler, HTML parser, Markdown converter, and mq-lang query processor. This eliminates dependency management and environment setup issues common with multi-component solutions.
# Download and run immediately - no installation required
curl -L https://github.com/harehare/mq/releases/latest/download/mqcr-linux-x86_64 -o mqcr
chmod +x mqcr
./mqcr https://docs.example.com
Core Functionality
mq-crawler
crawls websites, converts HTML to Markdown, and processes content with mq-lang queries. The tool respects robots.txt, implements rate limiting, and supports concurrent processing.
# Basic crawling with markdown output
mqcr https://docs.example.com
# Extract specific content sections
mqcr -q '.h2 | select(contains("API"))' https://docs.example.com
# Parallel crawling with 4 workers
mqcr -c 4 -o ./output https://docs.example.com
# Crawl only the specified URL (no recursion)
mqcr --depth 0 https://docs.example.com
HTML to Markdown Conversion
The conversion process handles complex HTML structures:
Table Processing
HTML tables convert to properly formatted Markdown tables with column alignment detection:
# Input HTML
<table>
<tr><th>Method</th><th>Endpoint</th><th>Description</th></tr>
<tr><td>GET</td><td>/api/users</td><td>List users</td></tr>
</table>
# Output Markdown
| Method | Endpoint | Description |
| ------ | ---------- | ----------- |
| GET | /api/users | List users |
Concurrent Processing
The crawler supports parallel processing for efficiency:
# Process multiple pages concurrently
mqcr -c 8 --crawl-delay 1.0 https://docs.example.com
# Configure timeouts for different operations
mqcr --page-load-timeout 30 --script-timeout 10 --implicit-timeout 5 https://example.com
Timeout options control different aspects:
-
--page-load-timeout
: Full page loading (default: 30s) -
--script-timeout
: JavaScript execution (default: 10s) -
--implicit-timeout
: Element finding (default: 5s)
Content Filtering with mq Queries
mq-crawler processes content through mq-lang queries for targeted extraction:
Code Example Collection
# Extract all code examples with context
mqcr -q '
.code |
{
"language": attr("lang"),
"code": to_text(),
}
' https://tutorial.example.com
Ethical Crawling Features
robots.txt Compliance
# Respect robots.txt automatically
mqcr https://example.com
# Use custom robots.txt
mqcr --robots-path ./custom-robots.txt https://example.com
Rate Limiting
# Configure crawl delays
mqcr --crawl-delay 2.0 https://example.com
# Respectful concurrent crawling
mqcr -c 3 --crawl-delay 1.5 https://example.com
Installation and Setup
Package Installation
# Install via Homebrew
brew install harehare/tap/mqcr
# Download pre-built binary directly
curl -L https://github.com/harehare/mq/releases/latest/download/mqcr-linux-x86_64 -o mqcr
chmod +x mqcr
# Or build from source
cargo install https://github.com/harehare/mq.git mq-crawler
Results and Benefits
The mq-crawler approach provides:
- Single Binary Deployment: immediate execution
- Clean Markdown Output: Structured content without HTML noise
- Targeted Extraction: Query-based filtering for relevant content
- Ethical Compliance: Automated robots.txt respect and rate limiting
- Scalable Processing: Concurrent crawling with configurable limits
- LLM-Ready Format: Properly formatted Markdown
This combination reduces manual preprocessing overhead while maintaining content quality for LLM applications.
Installation
# Using Homebrew
brew install harehare/tap/mqcr
# Direct binary download (no dependencies)
curl -L https://github.com/harehare/mq/releases/latest/download/mqcr-linux-x86_64 -o mqcr
chmod +x mqcr
For other installation methods, including Docker and pre-built binaries, check the official installation guide.
Resources
Support
- 🐛 Report bugs
- 💡 Request features
- ⭐ Star the project if you find it useful!
Top comments (0)