Marc Builds

Posted on May 29

TinySearch: Let Small Local LLMs Search the Web Without Burning Context

#ai #opensource #mcp #rag

I’ve been playing around with local LLM agents a lot lately.. Mostly smaller models, MCP tools, Cline/Roo-style workflows, and home lab setups.

Not the “infinite context, infinite budget” world.

More like:

“Can this 4B/9B model actually use the web without getting buried alive by garbage context?”

That was the problem that kept annoying me.

Most web-search tools technically work, but they often dump way too much raw page text into the model. You ask a simple question, and suddenly your local model is trying to reason through cookie banners, broken markdown, SEO filler, navigation menus, duplicated paragraphs, and five pages of irrelevant junk.

For small models, that is painful.

They do not need “the whole web”.

They need a small, useful, source-grounded slice of the web that matches the actual query.

So I built TinySearch.

GitHub: https://github.com/MarcellM01/TinySearch

What TinySearch does

TinySearch is a small open-source MCP research tool that:

searches the web
crawls selected pages
chunks the extracted content
reranks the useful parts
returns a compact, source-grounded prompt for your model

The flow is basically:

search -> crawl -> rerank -> return grounded prompt

That is the whole idea.

TinySearch does not answer the question itself.

It prepares the evidence.

Your actual LLM then answers using that evidence.

That matters because I do not want another LLM layer summarizing summaries. I want the model to receive clean, ranked, URL-attached context and reason from there.

Why I built it

The original pain was simple: I wanted local agents to have web research without insane context overhead.

When a tool dumps entire pages into context, it creates three problems:

it burns tokens for no reason
it confuses smaller models
it makes agent workflows feel heavier than they need to be

TinySearch is for that annoying middle ground where you want web research, but you do not want to set up a whole search stack or pay for a commercial API for every agent query.

It makes sense for:

local LLM agents
MCP workflows
small RAG experiments
personal research tools
coding agents that occasionally need current docs
smaller models that cannot handle context bloat

This is not trying to replace Perplexity, Exa, Tavily, or a production search system.

That would be cope.

It is just a small research layer that gives your model cleaner web context.

You can use it through Glama

One of the easiest ways to try TinySearch is through Glama.

So if you do not want to host it yourself, you can use the Glama option instead.

That is probably the lowest-friction path if you just want to plug it into an MCP-compatible workflow and test whether the tool fits your setup.

But if you prefer running things yourself, there is also a Docker image.

Docker image

The Docker version is the easiest self-hosted setup.

docker run --rm -p 8000:8000 \
  -e MCP_TRANSPORT=streamable-http \
  -e MCP_HOST=0.0.0.0 \
  marcellm01/tinysearch:latest

Then connect your MCP client to:

{
  "mcpServers": {
    "tinysearch": {
      "url": "http://localhost:8000/mcp"
    }
  }
}

TinySearch exposes one simple tool:

research(query)

You pass the user’s question as-is, and TinySearch handles the search, crawl, rerank, and prompt-building flow.

There is also an optional FastAPI server if you want to use it over HTTP instead of MCP.

What happens under the hood

The current pipeline looks roughly like this:

user question
   ↓
DuckDuckGo HTML search
   ↓
search-result reranking
   ↓
crawl selected pages with Crawl4AI
   ↓
chunk extracted markdown
   ↓
global chunk reranking
   ↓
dedupe + source quotas
   ↓
build source-grounded answer prompt

The final output is not just scraped text.

It is a structured prompt containing:

the original question
today’s date
instructions for the answering model
source titles
URLs
search previews
the most relevant extracted chunks

The goal is to shrink the web into something a small model can actually use.

Example output shape

TinySearch returns something like:

================================================================================
SEARCH-GROUNDED ANSWER PROMPT
================================================================================

QUESTION
What are the latest Basel III updates?

TODAY
2026-05-18

CRITICAL INSTRUCTIONS
Use only the text under RESULTS.
If the answer is not supported, say the results are not enough.
Cite source URLs after factual claims.

RESULTS

RESULT 1
TITLE
...

URL
...

SEARCH PREVIEW
...

RELEVANT TEXT
...

RESULT 2
...

That format is intentional.

I want the final model to know:

what evidence it can use
where the evidence came from
when the search happened
when it should say “not enough information”

This matters a lot for questions involving “latest”, “today”, “this year”, or anything time-sensitive.

Embeddings

TinySearch supports local ONNX embeddings or OpenAI-compatible embedding APIs.

The repo includes local presets like:

fast      -> all-MiniLM-L6-v2 ONNX
balanced  -> bge-small-en-v1.5
quality   -> bge-base-en-v1.5

You can start simple, then tune later.

Search depth, rerank weights, chunk limits, crawl concurrency, tokenizer settings, and embedding backend are configurable.

But the default idea stays the same:

return useful research context, not a landfill of raw page text.

What TinySearch is not

TinySearch is not magic.

It does not guarantee perfect search coverage.

It does not build a long-term index.

It does not replace proper production search infrastructure.

And honestly, that is the point.

The goal is not to be everything.

The goal is to be small, inspectable, and useful enough that you can drop it into a local agent workflow and immediately get better web research without context-window abuse.

Why I care about this

I think a lot of local agent work is blocked less by model intelligence and more by the surrounding harness.

People focus a lot on the model, but then the model gets wrapped in tools that behave as if context is free and every model has infinite attention.

That works badly for small models.

But even large models benefit from cleaner inputs.

LLMs need less junk.

Context is not a trash can.

That is the actual problem TinySearch is trying to solve.

Not “web search for AI” in some huge abstract way.

More like:

Can we give a local model just enough high-quality web context to answer properly, without burying it alive?

That is the game.

Try it

GitHub: https://github.com/MarcellM01/TinySearch

Docker:

docker run --rm -p 8000:8000 \
  -e MCP_TRANSPORT=streamable-http \
  -e MCP_HOST=0.0.0.0 \
  marcellm01/tinysearch:latest

Or use the Glama option if you do not want to host it yourself.

Feedback is very welcome, especially from people building with:

local LLMs
MCP
Cline / Roo-style coding agents
RAG systems
small model workflows
personal research agents

And yeah, roast it too.

That is usually where the useful feedback is anyway.

Suggested dev.to tags

#ai #opensource #mcp #llm

Top comments (1)

Harjot Singh • May 31

This is the right problem to obsess over, context is the scarcest resource for a 4B/9B model, and most web-search tools treat the model like it has infinite room and judgment. Dumping raw page text (cookie banners, nav, SEO filler, dupes) doesn't just waste tokens, it actively degrades reasoning, because a small model can't tell the signal from the junk and will happily anchor on a navigation menu. The insight that the retrieval layer should do the filtering, not the model, is exactly backwards from how most tools are built, and it's why small-model agents feel dumb when they're really just drowning. I live this daily, I route a lot of work through a context-firewall pattern where only a distilled summary ever reaches the model and the raw payload stays out of the window, same spirit as TinySearch. Cost-wise it's the whole game: clean context is what lets a cheap local model do work people assume needs a frontier one. How are you extracting the main content, a readability pass, or an embedding-rank of chunks against the query before anything hits the model?