As developers, we love giving our AI agents tools. "Here's a search tool! Here's a calculator! Here's the entire internet!" But with great power comes great... API bills.
One of the most annoying behaviors of naive RAG (Retrieval Augmented Generation) agents is redundancy. They love to search for things they already know.
In this post, I'll walk you through how I built a "Frugal Researcher" using Agno (formerly Phidata), Ollama (running Llama 3.2 locally), and Tavily. The goal? To stop the agent from wasting money on questions it should already be able to answer.
The Problem: The "Why?" Loop
Imagine this conversation flow:
- User: "Who is the CEO of DeepMind?"
- Agent: Calls Search Tool -> "Demis Hassabis" -> "The CEO is Demis Hassabis."
- User: "Why is he famous?"
- Agent: Calls Search Tool -> "Demis Hassabis famous for..." -> "He founded DeepMind..."
In step 2, the agent likely retrieved a biography that already explained why he is famous. But in step 4, it searched again.
This doubles our latency and our API costs. For a production app at scale, this is a disaster.
The Solution: Logic-Gated Tool Access
The fix isn't just better prompting (though that helps). The fix is Constraint.
If we know the user is asking a follow-up "reasoning" question, we should forbid the agent from answering with anything other than its memory.
The Stack
- Agno: A lightweight framework for building agents with memory and tools.
- Ollama: Running
llama3.2(3B parameters), a small but capable tailored model. - Tavily: A search API optimized for LLMs (returns JSON, not messy HTML).
- SQLite: Local database to store conversation history.
The Implementation
We wrap our agent interaction in a simple CLI loop (demo.py). Here is the magic logic:
# Check if the user is asking "why" to prevent redundant searches
original_tools = None
if "why" in user_input.lower() or "reason" in user_input.lower():
# ๐ STOP! Disabling tools.
original_tools = agent.tools
agent.tools = []
# Run the agent (forcing it to use context)
agent.print_response(user_input, stream=True)
# Restore tools for the next turn
if original_tools is not None:
agent.tools = original_tools
By temporarily setting agent.tools = [], we force the Agno agent to bypass its "Tool Selection" phase. It looks at the prompt, sees the conversation history (injected via add_history_to_context=True), and realizes: "I have no tools. I must answer from what I know."
Why Llama 3.2?
I chose the 3B parameter model because it's fast and runs easily on a standard laptop. However, smaller models can be stubborn. They might hallucinate a tool call even if they don't have tools!
To fix this, I added explicit instructions and retries=2 to the Agent configuration:
instructions=[
"If the user asks 'why'... explain your reasoning based on the search results you just found.",
"Do NOT trigger a new search...",
"ENSURE that your tool calls are valid JSON."
],
retries=2
The Results
The difference is night and day.
- Before: Every turn had a 2-3 second "Thinking..." pause while it queried Tavily.
- After: The "Why" response is almost instantaneous. The agent effectively says, "Oh, based on that article I just read mentioned in paragraph 1..."
Conclusion
Building agents isn't just about connecting models to tools. It's about orchestration. By guiding the agent's capabilities based on user intent, we can build systems that are faster, cheaper, and smarter.
Check out the full source code on GitHub: https://github.com/harishkotra/ollama-tavily
Top comments (0)