DEV Community

Cover image for Building Multi-Model AI Agents with OpenAI, Ollama, Groq and Gemini
Ferit
Ferit

Posted on

Building Multi-Model AI Agents with OpenAI, Ollama, Groq and Gemini

Most AI applications today rely on a single LLM provider. That works fine until the API goes down, rate limits hit, or your costs spiral out of control. A better approach is to build agents that can orchestrate multiple models and switch between them based on the task at hand.

In this article, I will walk through how I built an AI agent framework that supports OpenAI GPT-4, Ollama local models, Groq ultra-fast inference, and Google Gemini as interchangeable backends.

Why Multi-Model?

Each provider has different strengths:

  • OpenAI GPT-4 has the best reasoning and function calling
  • Ollama runs locally with zero latency and no API costs
  • Groq delivers sub-200ms inference for real-time applications
  • Gemini excels at multimodal tasks (vision, audio, code)

By abstracting the provider layer, your agent can pick the right model for each subtask, fall back gracefully when one provider fails, and optimize cost by routing simple tasks to cheaper models.

Architecture Overview

The framework has four main components:

Agent Core -> Planning -> Tool Execution -> Memory
     |            |            |              |
  LLM Router   Task Graph   Registry     Redis/PostgreSQL
     |
  OpenAI | Ollama | Groq | Gemini
Enter fullscreen mode Exit fullscreen mode

The LLM Router is the key piece. It decides which provider handles each request based on configurable rules.

Setting Up the Provider Layer

First, define a common interface that all providers implement:

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class LLMResponse:
    content: str
    model: str
    provider: str
    tokens_used: int
    latency_ms: float

class LLMProvider(ABC):
    @abstractmethod
    async def complete(self, messages: list, tools: list = None) -> LLMResponse:
        pass

    @abstractmethod
    async def embed(self, text: str) -> list[float]:
        pass
Enter fullscreen mode Exit fullscreen mode

Then implement each provider:

import openai
import ollama
from groq import Groq
import google.generativeai as genai

class OpenAIProvider(LLMProvider):
    def __init__(self, model="gpt-4"):
        self.client = openai.AsyncOpenAI()
        self.model = model

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            tools=tools
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response.choices[0].message.content,
            model=self.model,
            provider="openai",
            tokens_used=response.usage.total_tokens,
            latency_ms=latency
        )

class OllamaProvider(LLMProvider):
    def __init__(self, model="llama3"):
        self.model = model

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = await ollama.AsyncClient().chat(
            model=self.model,
            messages=messages
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response["message"]["content"],
            model=self.model,
            provider="ollama",
            tokens_used=response.get("eval_count", 0),
            latency_ms=latency
        )

class GroqProvider(LLMProvider):
    def __init__(self, model="llama3-70b-8192"):
        self.client = Groq()
        self.model = model

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response.choices[0].message.content,
            model=self.model,
            provider="groq",
            tokens_used=response.usage.total_tokens,
            latency_ms=latency
        )

class GeminiProvider(LLMProvider):
    def __init__(self, model="gemini-pro"):
        genai.configure(api_key=os.environ["GEMINI_API_KEY"])
        self.model = genai.GenerativeModel(model)

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = await self.model.generate_content_async(
            messages[-1]["content"]
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response.text,
            model="gemini-pro",
            provider="gemini",
            tokens_used=0,
            latency_ms=latency
        )
Enter fullscreen mode Exit fullscreen mode

The LLM Router

The router picks the best provider based on task type, latency requirements, and availability:

class LLMRouter:
    def __init__(self, providers: dict[str, LLMProvider]):
        self.providers = providers
        self.fallback_order = ["openai", "groq", "ollama", "gemini"]

    async def route(self, messages, task_type="general", tools=None):
        provider_name = self._select_provider(task_type)

        for name in self._get_fallback_chain(provider_name):
            try:
                provider = self.providers[name]
                return await provider.complete(messages, tools)
            except Exception as e:
                logger.warning(f"{name} failed: {e}, trying next provider")
                continue

        raise RuntimeError("All providers failed")

    def _select_provider(self, task_type):
        routing_rules = {
            "reasoning": "openai",
            "realtime": "groq",
            "local": "ollama",
            "vision": "gemini",
            "general": "openai"
        }
        return routing_rules.get(task_type, "openai")

    def _get_fallback_chain(self, primary):
        chain = [primary]
        for name in self.fallback_order:
            if name != primary:
                chain.append(name)
        return chain
Enter fullscreen mode Exit fullscreen mode

Building the Agent

With the router in place, the agent itself is straightforward:

class Agent:
    def __init__(self, router: LLMRouter, tools: ToolRegistry, memory: Memory):
        self.router = router
        self.tools = tools
        self.memory = memory

    async def execute(self, task: str) -> str:
        context = await self.memory.get_relevant(task)

        messages = [
            {"role": "system", "content": self._build_system_prompt(context)},
            {"role": "user", "content": task}
        ]

        while True:
            response = await self.router.route(
                messages,
                task_type=self._classify_task(task),
                tools=self.tools.get_schemas()
            )

            if not response.has_tool_calls:
                break

            tool_results = await self.tools.execute(response.tool_calls)
            messages.extend(tool_results)

        await self.memory.store(task, response.content)
        return response.content
Enter fullscreen mode Exit fullscreen mode

Tool Registry

Tools give the agent the ability to interact with external systems:

class ToolRegistry:
    def __init__(self):
        self._tools = {}

    def register(self, name: str, func, schema: dict):
        self._tools[name] = {"func": func, "schema": schema}

    async def execute(self, tool_calls):
        results = []
        for call in tool_calls:
            tool = self._tools[call.name]
            result = await tool["func"](**call.arguments)
            results.append({
                "role": "tool",
                "content": str(result),
                "tool_call_id": call.id
            })
        return results

    @classmethod
    def default(cls):
        registry = cls()
        registry.register("web_search", web_search, web_search_schema)
        registry.register("code_execute", code_execute, code_execute_schema)
        registry.register("file_read", file_read, file_read_schema)
        return registry
Enter fullscreen mode Exit fullscreen mode

Putting It All Together

providers = {
    "openai": OpenAIProvider("gpt-4"),
    "ollama": OllamaProvider("llama3"),
    "groq": GroqProvider("llama3-70b-8192"),
    "gemini": GeminiProvider("gemini-pro")
}

router = LLMRouter(providers)
tools = ToolRegistry.default()
memory = RedisMemory(url="redis://localhost:6379")

agent = Agent(router=router, tools=tools, memory=memory)

result = await agent.execute(
    "Analyze the performance bottlenecks in our API and suggest fixes"
)
Enter fullscreen mode Exit fullscreen mode

Cost Optimization

One of the biggest benefits of multi-model routing is cost control. Here is a practical routing strategy:

Task Type Provider Cost per 1M tokens
Complex reasoning OpenAI GPT-4 $30
Simple Q&A Groq LLaMA 3 $0.59
Code generation Ollama (local) $0
Image analysis Gemini Pro $0.50

By routing 70% of requests to Groq/Ollama and only using GPT-4 for complex tasks, we reduced our monthly AI costs by 80%.

What I Learned

Building this framework taught me a few things:

  1. Provider abstraction pays off fast. The moment one API has an outage, your system keeps running.
  2. Latency varies wildly. Groq at 200ms vs OpenAI at 1-2s makes a real difference for interactive applications.
  3. Local models are underrated. Ollama with LLaMA 3 handles 80% of tasks without any API calls.
  4. Memory is the hard part. Deciding what to remember and what to forget matters more than which model you use.

The full source code is available on GitHub: ai-agent-framework


If you are building AI agents or working with multiple LLM providers, I would love to hear about your approach. Drop a comment below or connect with me on GitHub.

Top comments (0)