Search-Scrape: A privacy-first, Rust-native search & scraping MCP for AI assistants

#programming #mcp #rust #ai

How I scaled a tiny tool into a full-featured, agent-friendly web access stack — and why you should try it today

Short version: Search-Scrape is a Rust-native set of MCP tools that gives AI assistants programmatic, agent-friendly web search and content scraping — 100% free, self-hosted, and privacy-first. It bundles federated SearXNG search, smart scraping, JSON output for agents, research history (Qdrant), code-block extraction, quality scoring, and more.

Why write this now

A few months ago someone published a short review of this project when it had far fewer features; since then Search-Scrape has grown substantially. This post explains what’s new, why the tool matters for AI-first workflows, and why an open, local, federated approach beats paid search APIs for many use cases — especially for privacy-sensitive and agentized setups. (Context & older review: Skywork review.) Skywork

Elevator pitch

If you build or operate AI assistants (VS Code/Cursor/Trae or custom agents), Search-Scrape gives your agent:

Federated, configurable web search (via SearXNG) with full engine/category/time parameters.
Intelligent, noise-filtered scraping that preserves code blocks and returns structured JSON for programmatic consumption.
Local research history (embeddings + Qdrant) so agents can avoid repetition and do semantic search over past work.
No API keys, no per-request billing — run it on your infrastructure and keep your data private.

What’s new & notable (high-level features)

The project README and changelog list the load-bearing features that make this more than a “simple scraper”:

Full SearXNG parameter support (engines, categories, safesearch, time range, pagination).
Intelligent scraping: content extraction with automatic noise removal (ads, headers/footers, nav) and smart link filtering to capture primary-content links only.
Agent-friendly JSON output and code-block extraction for reliable developer workflows (keeps syntax + language hints).
Research history with semantic search (optional Qdrant integration) to store embeddings and avoid duplicate work.
Quality scoring, classification, and machine-readable warnings to help agents rank and safely use scraped content.

(Short: it’s targeted at agent ergonomics — structured outputs, quality signals, and memory.)

Why federated + self-hosted search (SearXNG) matters

SearXNG is a metasearch engine that aggregates results from many upstream engines while not tracking or profiling users. That makes it an excellent backend for a privacy-first agent that needs diverse coverage (Google, DuckDuckGo, Bing, etc.) without sending your queries to a single vendor. If you self-host SearXNG, you control the instance and the logs. docs.searxng.org

The advantages vs. paid search APIs (straight talk)

Paid search APIs and big cloud search services have their place, but here are consistent downsides they bring — and how Search-Scrape addresses them.

1. Cost & rate limits
Paid APIs charge per request or per token and often impose rate limits. For agent-heavy automation (hundreds–thousands of agent-initiated searches), costs scale fast. Search-Scrape is self-hosted and free to run (aside from your infra), so you can run many queries without per-request fees. GitHub

2. Privacy & profiling risk
Large search providers log and use query data for personalization, ads, and product improvement. In recent years there have been high-profile privacy and legal issues tied to how large vendors collect and use user data. Running your own SearXNG-backed search avoids sending raw queries into a centralized vendor’s profiling pipeline.AP News

3. Vendor lock-in & control
APIs change pricing, contracts, or quotas — making reproducible agent behavior brittle. Self-hosting with open-source components (SearXNG, Qdrant, Rust server) gives you control and transparency.

4. Agent-friendliness
Paid search APIs often return opaque or HTML-heavy results. Search-Scrape focuses on structured JSON outputs, code-block preservation, and quality signals — things agents can act on immediately.

The tradeoffs (be honest)

No tool is perfect. Here are realistic constraints to consider:

You must self-host and operate components (SearXNG, optional Qdrant). That brings ops overhead (deploy, monitor, update). The README has a Docker Compose quickstart to reduce friction.
Maintenance & instance selection — if you rely on third-party public SearXNG instances, availability/quality varies; self-hosting is recommended for production. Medium
Legal/robots considerations — scraping has legal and ethical boundaries depending on the sites you target; always respect terms of service and robots.txt. Search-Scrape helps by being content-aware, but governance is still your responsibility.
Coverage vs. perfection — federated metasearch gives breadth but not always the same depth or freshness as a tightly integrated vendor pipeline (for a small subset of specialized results).

Quick-start (copy-paste friendly)

A minimal local setup (from the README):

# 1. Start SearXNG (docker-compose)
docker-compose up searxng -d

# 2. Optional: start Qdrant for research history
docker-compose up qdrant -d

# 3. Build the MCP server
cd mcp-server && cargo build --release

# 4. Add to your assistant's MCP config:
{
  "mcpServers": {
    "search-scrape": {
      "command": "/path/to/mcp-server/target/release/search-scrape-mcp",
      "env": {
        "SEARXNG_URL": "http://localhost:8888",
        "SEARXNG_ENGINES": "google,bing,duckduckgo",
        "QDRANT_URL": "http://localhost:6334",
        "MAX_LINKS": "100"
      }
    }
  }
}

(See README for env var defaults and detailed parameter docs.)GitHub

Example workflows where Search-Scrape shines

AI pair programmer in VS Code / Cursor: ask an assistant to "find the best example of using tokio::spawn with error handling" — agent runs a federated search, scrapes code blocks, scores them, and returns the most relevant snippet with sources.
Automated research assistant: keep a Qdrant-backed research history so your assistant can recall prior searches and surface prior findings instead of repeating work.
Privacy-sensitive deployments: internal knowledge assistants for regulated teams (legal, security, healthcare) where query telemetry must remain on-prem. docs.searxng.org

A note about an older review

There’s an earlier external review that was written when the project had fewer features. That review remains a helpful snapshot, but the project now includes JSON output, code extraction, semantic research history, quality scoring, and many agent optimizations — so the current release offers substantially more value than that older piece suggested. (See Skywork snapshot.) Skywork

How to get involved / next steps

Try the Docker Compose quickstart and connect it to your assistant (VS Code/Cursor/MCP client).
If you like Rust, the repo welcomes contributions (features, engines list, site-specific extractors).
Share feedback, or open an issue with a site that needs better scraping rules — help us improve the content-aware extraction.

TL;DR — Should you try it?

Yes, if you:

run or develop AI assistants and need high-throughput, agent-friendly web access;
want to avoid per-query billing and vendor profiling;
value local control, reproducibility, and structured outputs for agents.

If you need a zero-ops, absolutely-turnkey search product and you don’t care about vendor tracking or costs, a paid API may still be simpler. But for privacy-first, scalable agent workflows, Search-Scrape is convincingly better.

Search-Scrape