DEV Community

Marc Builds
Marc Builds

Posted on

TinySearch: Let Small Local LLMs Search the Web Without Burning Context

I’ve been playing around with local LLM agents a lot lately.. Mostly smaller models, MCP tools, Cline/Roo-style workflows, and home lab setups.

Not the “infinite context, infinite budget” world.

More like:

“Can this 4B/9B model actually use the web without getting buried alive by garbage context?”

That was the problem that kept annoying me.

Most web-search tools technically work, but they often dump way too much raw page text into the model. You ask a simple question, and suddenly your local model is trying to reason through cookie banners, broken markdown, SEO filler, navigation menus, duplicated paragraphs, and five pages of irrelevant junk.

For small models, that is painful.

They do not need “the whole web”.

They need a small, useful, source-grounded slice of the web that matches the actual query.

So I built TinySearch.

GitHub: https://github.com/MarcellM01/TinySearch


Gif: How a practical search looks like, from crawling to returned prompt

What TinySearch does

TinySearch is a small open-source MCP research tool that:

  1. searches the web
  2. crawls selected pages
  3. chunks the extracted content
  4. reranks the useful parts
  5. returns a compact, source-grounded prompt for your model

The flow is basically:

search -> crawl -> rerank -> return grounded prompt
Enter fullscreen mode Exit fullscreen mode

That is the whole idea.

TinySearch does not answer the question itself.

It prepares the evidence.

Your actual LLM then answers using that evidence.

That matters because I do not want another LLM layer summarizing summaries. I want the model to receive clean, ranked, URL-attached context and reason from there.


Why I built it

The original pain was simple: I wanted local agents to have web research without insane context overhead.

When a tool dumps entire pages into context, it creates three problems:

  • it burns tokens for no reason
  • it confuses smaller models
  • it makes agent workflows feel heavier than they need to be

TinySearch is for that annoying middle ground where you want web research, but you do not want to set up a whole search stack or pay for a commercial API for every agent query.

It makes sense for:

  • local LLM agents
  • MCP workflows
  • small RAG experiments
  • personal research tools
  • coding agents that occasionally need current docs
  • smaller models that cannot handle context bloat

This is not trying to replace Perplexity, Exa, Tavily, or a production search system.

That would be cope.

It is just a small research layer that gives your model cleaner web context.


You can use it through Glama

One of the easiest ways to try TinySearch is through Glama.

So if you do not want to host it yourself, you can use the Glama option instead.

That is probably the lowest-friction path if you just want to plug it into an MCP-compatible workflow and test whether the tool fits your setup.

But if you prefer running things yourself, there is also a Docker image.


Docker image

The Docker version is the easiest self-hosted setup.

docker run --rm -p 8000:8000 \
  -e MCP_TRANSPORT=streamable-http \
  -e MCP_HOST=0.0.0.0 \
  marcellm01/tinysearch:latest
Enter fullscreen mode Exit fullscreen mode

Then connect your MCP client to:

{
  "mcpServers": {
    "tinysearch": {
      "url": "http://localhost:8000/mcp"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

TinySearch exposes one simple tool:

research(query)
Enter fullscreen mode Exit fullscreen mode

You pass the user’s question as-is, and TinySearch handles the search, crawl, rerank, and prompt-building flow.

There is also an optional FastAPI server if you want to use it over HTTP instead of MCP.


What happens under the hood

The current pipeline looks roughly like this:

user question
   ↓
DuckDuckGo HTML search
   ↓
search-result reranking
   ↓
crawl selected pages with Crawl4AI
   ↓
chunk extracted markdown
   ↓
global chunk reranking
   ↓
dedupe + source quotas
   ↓
build source-grounded answer prompt
Enter fullscreen mode Exit fullscreen mode

The final output is not just scraped text.

It is a structured prompt containing:

  • the original question
  • today’s date
  • instructions for the answering model
  • source titles
  • URLs
  • search previews
  • the most relevant extracted chunks

The goal is to shrink the web into something a small model can actually use.


Example output shape

TinySearch returns something like:

================================================================================
SEARCH-GROUNDED ANSWER PROMPT
================================================================================

QUESTION
What are the latest Basel III updates?

TODAY
2026-05-18

CRITICAL INSTRUCTIONS
Use only the text under RESULTS.
If the answer is not supported, say the results are not enough.
Cite source URLs after factual claims.

RESULTS

RESULT 1
TITLE
...

URL
...

SEARCH PREVIEW
...

RELEVANT TEXT
...

RESULT 2
...
Enter fullscreen mode Exit fullscreen mode

That format is intentional.

I want the final model to know:

  • what evidence it can use
  • where the evidence came from
  • when the search happened
  • when it should say “not enough information”

This matters a lot for questions involving “latest”, “today”, “this year”, or anything time-sensitive.


Embeddings

TinySearch supports local ONNX embeddings or OpenAI-compatible embedding APIs.

The repo includes local presets like:

fast      -> all-MiniLM-L6-v2 ONNX
balanced  -> bge-small-en-v1.5
quality   -> bge-base-en-v1.5
Enter fullscreen mode Exit fullscreen mode

You can start simple, then tune later.

Search depth, rerank weights, chunk limits, crawl concurrency, tokenizer settings, and embedding backend are configurable.

But the default idea stays the same:

return useful research context, not a landfill of raw page text.


What TinySearch is not

TinySearch is not magic.

It does not guarantee perfect search coverage.

It does not build a long-term index.

It does not replace proper production search infrastructure.

And honestly, that is the point.

The goal is not to be everything.

The goal is to be small, inspectable, and useful enough that you can drop it into a local agent workflow and immediately get better web research without context-window abuse.


Why I care about this

I think a lot of local agent work is blocked less by model intelligence and more by the surrounding harness.

People focus a lot on the model, but then the model gets wrapped in tools that behave as if context is free and every model has infinite attention.

That works badly for small models.

But even large models benefit from cleaner inputs.

LLMs need less junk.

Context is not a trash can.

That is the actual problem TinySearch is trying to solve.

Not “web search for AI” in some huge abstract way.

More like:

Can we give a local model just enough high-quality web context to answer properly, without burying it alive?

That is the game.


Try it

GitHub: https://github.com/MarcellM01/TinySearch

Docker:

docker run --rm -p 8000:8000 \
  -e MCP_TRANSPORT=streamable-http \
  -e MCP_HOST=0.0.0.0 \
  marcellm01/tinysearch:latest
Enter fullscreen mode Exit fullscreen mode

Or use the Glama option if you do not want to host it yourself.

Feedback is very welcome, especially from people building with:

  • local LLMs
  • MCP
  • Cline / Roo-style coding agents
  • RAG systems
  • small model workflows
  • personal research agents

And yeah, roast it too.

That is usually where the useful feedback is anyway.


Suggested dev.to tags

#ai #opensource #mcp #llm
Enter fullscreen mode Exit fullscreen mode

Top comments (0)