Rost

Posted on Dec 5 • Originally published at glukhov.org

Using Ollama Web Search API in Python

#ai #ollama #llm #python

Ollama's Python library now includes native OLlama web search capabilities. With just a few lines of code, you can augment your local LLMs with real-time information from the web, reducing hallucinations and improving accuracy.

Getting Started

How do I install the Ollama Python library for web search? Install version 0.6.0 or higher using pip install 'ollama>=0.6.0'. This version includes the web_search and web_fetch functions.

pip install 'ollama>=0.6.0'

For managing Python environments and packages, consider using uv, a fast Python package manager, or set up a virtual environment using venv to keep your dependencies isolated.

Create an API key from your Ollama account and set it as an environment variable:

export OLLAMA_API_KEY="your_api_key"

On Windows PowerShell:

$env:OLLAMA_API_KEY = "your_api_key"

Basic Web Search

The simplest way to search the web with Ollama:

import ollama

# Simple web search
response = ollama.web_search("What is Ollama?")
print(response)

Output:

results = [
    {
        "title": "Ollama",
        "url": "https://ollama.com/",
        "content": "Cloud models are now available in Ollama..."
    },
    {
        "title": "What is Ollama? Features, Pricing, and Use Cases",
        "url": "https://www.walturn.com/insights/what-is-ollama",
        "content": "Our services..."
    },
    {
        "title": "Complete Ollama Guide: Installation, Usage & Code Examples",
        "url": "https://collabnix.com/complete-ollama-guide",
        "content": "Join our Discord Server..."
    }
]

Controlling Result Count

import ollama

# Get more results
response = ollama.web_search("latest AI news", max_results=10)

for result in response.results:
    print(f"📌 {result.title}")
    print(f"   {result.url}")
    print(f"   {result.content[:100]}...")
    print()

Fetching Full Page Content

What is web_search vs web_fetch in Ollama Python? web_search queries the internet and returns multiple search results with titles, URLs, and snippets. web_fetch retrieves the full content of a specific URL, returning the page title, markdown content, and links. The markdown content returned by web_fetch is perfect for further processing—if you need to convert HTML to markdown in other contexts, see our guide on converting HTML to Markdown with Python.

from ollama import web_fetch

result = web_fetch('https://ollama.com')
print(result)

Output:

WebFetchResponse(
    title='Ollama',
    content='[Cloud models](https://ollama.com/blog/cloud-models) are now available in Ollama\n\n**Chat & build with open models**\n\n[Download](https://ollama.com/download) [Explore models](https://ollama.com/models)\n\nAvailable for macOS, Windows, and Linux',
    links=['https://ollama.com/', 'https://ollama.com/models', 'https://github.com/ollama/ollama']
)

Combining Search and Fetch

A common pattern is to search first, then fetch full content from relevant results:

from ollama import web_search, web_fetch

# Search for information
search_results = web_search("Ollama new features 2025")

# Fetch full content from the first result
if search_results.results:
    first_url = search_results.results[0].url
    full_content = web_fetch(first_url)

    print(f"Title: {full_content.title}")
    print(f"Content: {full_content.content[:500]}...")
    print(f"Links found: {len(full_content.links)}")

Building a Search Agent

Which Python models work best for Ollama search agents? Models with strong tool-use capabilities work best, including qwen3, gpt-oss, and cloud models like qwen3:480b-cloud and deepseek-v3.1-cloud. For more advanced use cases requiring structured outputs from these models, check out our guide on LLMs with Structured Output using Ollama and Qwen3.

First, pull a capable model:

ollama pull qwen3:4b

Simple Search Agent

Here's a basic search agent that can autonomously decide when to search:

from ollama import chat, web_fetch, web_search

available_tools = {'web_search': web_search, 'web_fetch': web_fetch}

messages = [{'role': 'user', 'content': "what is ollama's new engine"}]

while True:
    response = chat(
        model='qwen3:4b',
        messages=messages,
        tools=[web_search, web_fetch],
        think=True
    )

    if response.message.thinking:
        print('🧠 Thinking:', response.message.thinking[:200], '...')

    if response.message.content:
        print('💬 Response:', response.message.content)

    messages.append(response.message)

    if response.message.tool_calls:
        print('🔧 Tool calls:', response.message.tool_calls)
        for tool_call in response.message.tool_calls:
            function_to_call = available_tools.get(tool_call.function.name)
            if function_to_call:
                args = tool_call.function.arguments
                result = function_to_call(**args)
                print('📥 Result:', str(result)[:200], '...')
                # Truncate result for limited context lengths
                messages.append({
                    'role': 'tool', 
                    'content': str(result)[:2000 * 4], 
                    'tool_name': tool_call.function.name
                })
            else:
                messages.append({
                    'role': 'tool', 
                    'content': f'Tool {tool_call.function.name} not found', 
                    'tool_name': tool_call.function.name
                })
    else:
        break

How do I handle large web search results in Python? Truncate results to fit context limits. The recommended approach is to slice the result string to approximately 8000 characters (2000 tokens × 4 chars) before passing to the model.

Advanced Search Agent with Error Handling

Here's an enhanced version with better error handling:

from ollama import chat, web_fetch, web_search
import json

class SearchAgent:
    def __init__(self, model: str = 'qwen3:4b'):
        self.model = model
        self.tools = {'web_search': web_search, 'web_fetch': web_fetch}
        self.messages = []
        self.max_iterations = 10

    def query(self, question: str) -> str:
        self.messages = [{'role': 'user', 'content': question}]

        for iteration in range(self.max_iterations):
            try:
                response = chat(
                    model=self.model,
                    messages=self.messages,
                    tools=[web_search, web_fetch],
                    think=True
                )
            except Exception as e:
                return f"Error during chat: {e}"

            self.messages.append(response.message)

            # If no tool calls, we have a final answer
            if not response.message.tool_calls:
                return response.message.content or "No response generated"

            # Execute tool calls
            for tool_call in response.message.tool_calls:
                result = self._execute_tool(tool_call)
                self.messages.append({
                    'role': 'tool',
                    'content': result,
                    'tool_name': tool_call.function.name
                })

        return "Max iterations reached without final answer"

    def _execute_tool(self, tool_call) -> str:
        func_name = tool_call.function.name
        args = tool_call.function.arguments

        if func_name not in self.tools:
            return f"Unknown tool: {func_name}"

        try:
            result = self.tools[func_name](**args)
            # Truncate for context limits
            result_str = str(result)
            if len(result_str) > 8000:
                result_str = result_str[:8000] + "... [truncated]"
            return result_str
        except Exception as e:
            return f"Tool error: {e}"

# Usage
agent = SearchAgent(model='qwen3:4b')
answer = agent.query("What are the latest features in Ollama?")
print(answer)

Async Web Search

Can I use Ollama Python web search with async code? Yes, the Ollama Python library supports async operations. Use AsyncClient for non-blocking web search and fetch operations in async applications. For performance comparisons between Python and other languages in serverless contexts, see our analysis of AWS Lambda performance across JavaScript, Python, and Golang.

import asyncio
from ollama import AsyncClient

async def async_search():
    client = AsyncClient()

    # Perform multiple searches concurrently
    tasks = [
        client.web_search("Ollama features"),
        client.web_search("local LLM tools"),
        client.web_search("AI search agents"),
    ]

    results = await asyncio.gather(*tasks)

    for i, result in enumerate(results):
        print(f"Search {i + 1}:")
        for r in result.results[:2]:
            print(f"  - {r.title}")
        print()

# Run async search
asyncio.run(async_search())

Async Search Agent

import asyncio
from ollama import AsyncClient

async def async_research_agent(question: str):
    client = AsyncClient()
    messages = [{'role': 'user', 'content': question}]

    while True:
        response = await client.chat(
            model='qwen3:4b',
            messages=messages,
            tools=[client.web_search, client.web_fetch],
        )

        messages.append(response.message)

        if not response.message.tool_calls:
            return response.message.content

        # Execute tool calls concurrently
        tool_tasks = []
        for tool_call in response.message.tool_calls:
            if tool_call.function.name == 'web_search':
                task = client.web_search(**tool_call.function.arguments)
            elif tool_call.function.name == 'web_fetch':
                task = client.web_fetch(**tool_call.function.arguments)
            else:
                continue
            tool_tasks.append((tool_call.function.name, task))

        # Gather results
        for tool_name, task in tool_tasks:
            result = await task
            messages.append({
                'role': 'tool',
                'content': str(result)[:8000],
                'tool_name': tool_name
            })

# Run
answer = asyncio.run(async_research_agent("What's new in Python 3.13?"))
print(answer)

Context Length and Performance

What context length should I set for Python search agents? Set context length to approximately 32000 tokens for reasonable performance. Search agents work best with full context length since web_search and web_fetch can return thousands of tokens.

from ollama import chat, web_search

# Set higher context for search-heavy tasks
response = chat(
    model='qwen3:4b',
    messages=[{'role': 'user', 'content': 'Research the latest AI developments'}],
    tools=[web_search],
    options={
        'num_ctx': 32768,  # 32K context
    }
)

MCP Server Integration

Ollama provides a Python MCP server that enables web search in any MCP client. For a comprehensive guide on building MCP servers in Python with web search and scraping capabilities, see our detailed tutorial on Building MCP Servers in Python.

Cline Integration

Configure MCP servers in Cline settings:

Manage MCP Servers → Configure MCP Servers → Add:

{
  "mcpServers": {
    "web_search_and_fetch": {
      "type": "stdio",
      "command": "uv",
      "args": ["run", "path/to/web-search-mcp.py"],
      "env": { "OLLAMA_API_KEY": "your_api_key_here" }
    }
  }
}

Codex Integration

Add to ~/.codex/config.toml:

[mcp_servers.web_search]
command = "uv"
args = ["run", "path/to/web-search-mcp.py"]
env = { "OLLAMA_API_KEY" = "your_api_key_here" }

Creating Your Own MCP Server

#!/usr/bin/env python3
"""Simple MCP server for Ollama web search."""

import os
from mcp.server import Server
from mcp.types import Tool, TextContent
from ollama import web_search, web_fetch

app = Server("ollama-web-search")

@app.tool()
async def search_web(query: str, max_results: int = 5) -> str:
    """Search the web for information."""
    results = web_search(query, max_results=max_results)

    output = []
    for r in results.results:
        output.append(f"**{r.title}**\n{r.url}\n{r.content}\n")

    return "\n---\n".join(output)

@app.tool()
async def fetch_page(url: str) -> str:
    """Fetch the full content of a web page."""
    result = web_fetch(url)
    return f"# {result.title}\n\n{result.content}"

if __name__ == "__main__":
    app.run()

Practical Examples

These examples demonstrate real-world applications of Ollama's web search API. You can extend these patterns to build more complex systems—for instance, combining search results with PDF generation in Python to create research reports.

News Summarizer

from ollama import chat, web_search

def summarize_news(topic: str) -> str:
    # Search for recent news
    results = web_search(f"{topic} latest news", max_results=5)

    # Format search results for the model
    news_content = "\n\n".join([
        f"**{r.title}**\n{r.content}"
        for r in results.results
    ])

    # Ask the model to summarize
    response = chat(
        model='qwen3:4b',
        messages=[{
            'role': 'user',
            'content': f"Summarize these news items about {topic}:\n\n{news_content}"
        }]
    )

    return response.message.content

summary = summarize_news("artificial intelligence")
print(summary)

Research Assistant

from ollama import chat, web_search, web_fetch
from dataclasses import dataclass

@dataclass
class ResearchResult:
    question: str
    sources: list
    answer: str

def research(question: str) -> ResearchResult:
    # Search for relevant information
    search_results = web_search(question, max_results=3)

    # Fetch full content from top sources
    sources = []
    full_content = []

    for result in search_results.results[:3]:
        try:
            page = web_fetch(result.url)
            sources.append(result.url)
            full_content.append(f"Source: {result.url}\n{page.content[:2000]}")
        except:
            continue

    # Generate comprehensive answer
    context = "\n\n---\n\n".join(full_content)

    response = chat(
        model='qwen3:4b',
        messages=[{
            'role': 'user',
            'content': f"""Based on the following sources, answer this question: {question}

Sources:
{context}

Provide a comprehensive answer with citations to the sources."""
        }]
    )

    return ResearchResult(
        question=question,
        sources=sources,
        answer=response.message.content
    )

# Usage
result = research("How does Ollama's new model scheduling work?")
print(f"Question: {result.question}")
print(f"Sources: {result.sources}")
print(f"Answer: {result.answer}")

Recommended Models

Model	Parameters	Best For
`qwen3:4b`	4B	Quick local searches
`qwen3`	8B	General purpose agent
`gpt-oss`	Various	Research tasks
`qwen3:480b-cloud`	480B	Complex reasoning (cloud)
`gpt-oss:120b-cloud`	120B	Long-form research (cloud)
`deepseek-v3.1-cloud`	-	Advanced analysis (cloud)

Best Practices

Truncate Results: Always truncate web results to fit context limits (~8000 chars)
Error Handling: Wrap tool calls in try/except for network failures
Rate Limiting: Respect Ollama's API rate limits for web search
Context Length: Use ~32000 tokens for search agents
Async for Scale: Use AsyncClient for concurrent operations
Testing: Write unit tests for your search agents to ensure reliability
Python Basics: Keep a Python cheatsheet handy for quick reference on syntax and common patterns

Useful Links

Top comments (2)

Scott Reno • Dec 5

Great article!

Aleh - • Dec 9

great article👍