Most AI applications today rely on a single LLM provider. That works fine until the API goes down, rate limits hit, or your costs spiral out of control. A better approach is to build agents that can orchestrate multiple models and switch between them based on the task at hand.
In this article, I will walk through how I built an AI agent framework that supports OpenAI GPT-4, Ollama local models, Groq ultra-fast inference, and Google Gemini as interchangeable backends.
Why Multi-Model?
Each provider has different strengths:
- OpenAI GPT-4 has the best reasoning and function calling
- Ollama runs locally with zero latency and no API costs
- Groq delivers sub-200ms inference for real-time applications
- Gemini excels at multimodal tasks (vision, audio, code)
By abstracting the provider layer, your agent can pick the right model for each subtask, fall back gracefully when one provider fails, and optimize cost by routing simple tasks to cheaper models.
Architecture Overview
The framework has four main components:
Agent Core -> Planning -> Tool Execution -> Memory
| | | |
LLM Router Task Graph Registry Redis/PostgreSQL
|
OpenAI | Ollama | Groq | Gemini
The LLM Router is the key piece. It decides which provider handles each request based on configurable rules.
Setting Up the Provider Layer
First, define a common interface that all providers implement:
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class LLMResponse:
content: str
model: str
provider: str
tokens_used: int
latency_ms: float
class LLMProvider(ABC):
@abstractmethod
async def complete(self, messages: list, tools: list = None) -> LLMResponse:
pass
@abstractmethod
async def embed(self, text: str) -> list[float]:
pass
Then implement each provider:
import openai
import ollama
from groq import Groq
import google.generativeai as genai
class OpenAIProvider(LLMProvider):
def __init__(self, model="gpt-4"):
self.client = openai.AsyncOpenAI()
self.model = model
async def complete(self, messages, tools=None):
start = time.monotonic()
response = await self.client.chat.completions.create(
model=self.model,
messages=messages,
tools=tools
)
latency = (time.monotonic() - start) * 1000
return LLMResponse(
content=response.choices[0].message.content,
model=self.model,
provider="openai",
tokens_used=response.usage.total_tokens,
latency_ms=latency
)
class OllamaProvider(LLMProvider):
def __init__(self, model="llama3"):
self.model = model
async def complete(self, messages, tools=None):
start = time.monotonic()
response = await ollama.AsyncClient().chat(
model=self.model,
messages=messages
)
latency = (time.monotonic() - start) * 1000
return LLMResponse(
content=response["message"]["content"],
model=self.model,
provider="ollama",
tokens_used=response.get("eval_count", 0),
latency_ms=latency
)
class GroqProvider(LLMProvider):
def __init__(self, model="llama3-70b-8192"):
self.client = Groq()
self.model = model
async def complete(self, messages, tools=None):
start = time.monotonic()
response = self.client.chat.completions.create(
model=self.model,
messages=messages
)
latency = (time.monotonic() - start) * 1000
return LLMResponse(
content=response.choices[0].message.content,
model=self.model,
provider="groq",
tokens_used=response.usage.total_tokens,
latency_ms=latency
)
class GeminiProvider(LLMProvider):
def __init__(self, model="gemini-pro"):
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
self.model = genai.GenerativeModel(model)
async def complete(self, messages, tools=None):
start = time.monotonic()
response = await self.model.generate_content_async(
messages[-1]["content"]
)
latency = (time.monotonic() - start) * 1000
return LLMResponse(
content=response.text,
model="gemini-pro",
provider="gemini",
tokens_used=0,
latency_ms=latency
)
The LLM Router
The router picks the best provider based on task type, latency requirements, and availability:
class LLMRouter:
def __init__(self, providers: dict[str, LLMProvider]):
self.providers = providers
self.fallback_order = ["openai", "groq", "ollama", "gemini"]
async def route(self, messages, task_type="general", tools=None):
provider_name = self._select_provider(task_type)
for name in self._get_fallback_chain(provider_name):
try:
provider = self.providers[name]
return await provider.complete(messages, tools)
except Exception as e:
logger.warning(f"{name} failed: {e}, trying next provider")
continue
raise RuntimeError("All providers failed")
def _select_provider(self, task_type):
routing_rules = {
"reasoning": "openai",
"realtime": "groq",
"local": "ollama",
"vision": "gemini",
"general": "openai"
}
return routing_rules.get(task_type, "openai")
def _get_fallback_chain(self, primary):
chain = [primary]
for name in self.fallback_order:
if name != primary:
chain.append(name)
return chain
Building the Agent
With the router in place, the agent itself is straightforward:
class Agent:
def __init__(self, router: LLMRouter, tools: ToolRegistry, memory: Memory):
self.router = router
self.tools = tools
self.memory = memory
async def execute(self, task: str) -> str:
context = await self.memory.get_relevant(task)
messages = [
{"role": "system", "content": self._build_system_prompt(context)},
{"role": "user", "content": task}
]
while True:
response = await self.router.route(
messages,
task_type=self._classify_task(task),
tools=self.tools.get_schemas()
)
if not response.has_tool_calls:
break
tool_results = await self.tools.execute(response.tool_calls)
messages.extend(tool_results)
await self.memory.store(task, response.content)
return response.content
Tool Registry
Tools give the agent the ability to interact with external systems:
class ToolRegistry:
def __init__(self):
self._tools = {}
def register(self, name: str, func, schema: dict):
self._tools[name] = {"func": func, "schema": schema}
async def execute(self, tool_calls):
results = []
for call in tool_calls:
tool = self._tools[call.name]
result = await tool["func"](**call.arguments)
results.append({
"role": "tool",
"content": str(result),
"tool_call_id": call.id
})
return results
@classmethod
def default(cls):
registry = cls()
registry.register("web_search", web_search, web_search_schema)
registry.register("code_execute", code_execute, code_execute_schema)
registry.register("file_read", file_read, file_read_schema)
return registry
Putting It All Together
providers = {
"openai": OpenAIProvider("gpt-4"),
"ollama": OllamaProvider("llama3"),
"groq": GroqProvider("llama3-70b-8192"),
"gemini": GeminiProvider("gemini-pro")
}
router = LLMRouter(providers)
tools = ToolRegistry.default()
memory = RedisMemory(url="redis://localhost:6379")
agent = Agent(router=router, tools=tools, memory=memory)
result = await agent.execute(
"Analyze the performance bottlenecks in our API and suggest fixes"
)
Cost Optimization
One of the biggest benefits of multi-model routing is cost control. Here is a practical routing strategy:
| Task Type | Provider | Cost per 1M tokens |
|---|---|---|
| Complex reasoning | OpenAI GPT-4 | $30 |
| Simple Q&A | Groq LLaMA 3 | $0.59 |
| Code generation | Ollama (local) | $0 |
| Image analysis | Gemini Pro | $0.50 |
By routing 70% of requests to Groq/Ollama and only using GPT-4 for complex tasks, we reduced our monthly AI costs by 80%.
What I Learned
Building this framework taught me a few things:
- Provider abstraction pays off fast. The moment one API has an outage, your system keeps running.
- Latency varies wildly. Groq at 200ms vs OpenAI at 1-2s makes a real difference for interactive applications.
- Local models are underrated. Ollama with LLaMA 3 handles 80% of tasks without any API calls.
- Memory is the hard part. Deciding what to remember and what to forget matters more than which model you use.
The full source code is available on GitHub: ai-agent-framework
If you are building AI agents or working with multiple LLM providers, I would love to hear about your approach. Drop a comment below or connect with me on GitHub.
Top comments (0)