Building a Provider-Agnostic LLM Abstraction Layer: Benchmarking OpenAI, Gemini, Groq, DeepSeek and Ollama
1. Introduction – Why Multi-LLM Architecture Matters
While experimenting with multiple LLM providers like OpenAI, Gemini, Groq and local models via Ollama, I noticed a recurring architectural concern: tight coupling to a single provider SDK creates long-term rigidity.
A provider-agnostic abstraction layer lets you optimize for cost and performance at runtime without code changes.
2. Why do we need to solve this problem now
The LLM ecosystem is evolving rapidly. New providers are emerging, model capabilities are improving monthly, and pricing and latency characteristics vary significantly across platforms. A company committed to a single provider (OpenAI) gets locked in: if pricing increases 3x or a competitor releases a 10x faster model, switching costs are prohibitive. At scale, switching providers can save 30-50% on costs - but only if your architecture allows it. Without an abstraction layer, provider switching requires weeks of refactoring. With it, it's a configuration change. In 2026 this abstraction layer went from nice-to-have to essential infrastructure.
Another reason for switching provider and running models locally could be privacy and compliance with the tradeoffs of hardware constraint and possibly weaker reasoning.
The abstraction layer proposed here serves the goal of addressing the above challenges. It guarantees a standard contract irrespective of the provider and model. The current scope does not include streaming and tool calling.
2.1 What are we solving vs. what we are not
✅ Non-streamed text generation across 6+ providers
✅ Latency and cost benchmarking
✅ Provider-agnostic interface with factory pattern
❌ Streaming responses
❌ Tool calling and function calling
❌ Rate limit handling and circuit breaker
3. Problem Statement
3.1 Vendor Lock-In
Directly embedding provider SDK calls into application logic creates tight coupling. Switching providers later requires widespread refactoring.
3.2 Inconsistent API Interfaces
Each provider uses different request structures, parameter names, and response formats. Streaming behavior and token accounting also differ.
3.3 Operational Differences
Latency, rate limits, cost per token, and reasoning quality vary across providers. An architecture that cannot easily switch providers limits experimentation and optimization.
3.4 Security & Secret Management
Managing multiple API keys, isolating environments, and protecting prompt data becomes more complex in multi-provider setups.
3.5 Local vs Hosted Tradeoffs
Local models via Ollama provide privacy and control but introduce hardware constraints and potentially reduced reasoning capability.
4. Architecture: Designing the Abstraction Layer
4.1 Create your API keys
Firstly, you will need to create API keys for the providers you are trying to test. In my case I created keys for OpenAI, Gemini, Groq, Anthropic and DeepSeek.
For local development you can store them in a .env file. For Production use cases you will need to store them secrets manager applications like AWS Secrets Manager or GCP Secrets Manager.
4.2 Create a Provider Interface class
from typing import List, Dict, Any, Protocol
class ProviderAdapter(Protocol):
"""Provider-agnostic interface implemented by all provider adapters."""
def chat(self, messages: List[Dict[str, str]], **kwargs: Any) -> Dict[str, Any]:
"""Send a chat completion request and return a normalized response."""
...
def get_model_name(self) -> str:
...
def get_model_description(self) -> str:
...
4.3 Create a client class
Next create client class with a method to measure usage. Below is some code snippet -
def chat_with_usage(self, user_message: str, history: List[Dict[str, str]] | None = None, **kwargs: Any) -> Dict[str, Any]:
"""Return normalized response including raw SDK response for usage metrics."""
if history is None:
history = []
messages = history + [{"role": "user", "content": user_message}]
return self._adapter.chat(messages, **kwargs)
The client should then invoke the specific provider class based on the provider.
def create_llm_client(provider: str) -> LLMClient:
"""Simple factory for creating an LLMClient for a given provider name."""
provider = provider.lower()
if provider == "openai":
adapter = OpenAIAdapter()
elif provider == "gemini":
adapter = GeminiAdapter()
elif provider == "anthropic":
adapter = AnthropicAdapter()
elif provider == "deepseek":
adapter = DeepSeekAdapter()
elif provider == "ollama":
adapter = OllamaAdapter()
elif provider == "groq":
adapter = GroqAdapter()
else:
raise ValueError(f"Unknown provider: {provider}")
4.4 Create the adapter class
Below is an example of Gemini Adapter. We can create similar adapaters for OpenAI, Groq and DeepSeek. One thing to keep in mind is - Anthropic does not support OpenAI SDK.
class GeminiAdapter:
"""Adapter for Gemini using the OpenAI-compatible endpoint."""
def __init__(self, model: str = "gemini-2.5-flash"):
self._client = GeminiClient(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
)
self._model = model
def chat(self, messages: List[Dict[str, str]], **kwargs: Any) -> Dict[str, Any]:
response = self._client.chat.completions.create(
model=self._model,
messages=messages,
**kwargs,
)
return {
"text": response.choices[0].message.content,
"raw": response,
}
def get_model_name(self) -> str:
return self._model
def get_model_description(self) -> str:
return "Gemini 2.5 Flash, fast general-purpose model"
4.5 Create a class to calculate pricing and latency which will be used for benchmarking.
def calculate_cost(self, provider_key: str, input_tokens: int, output_tokens: int) -> float:
pricing = self.pricing_table[provider_key]
if "input_per_1k" in pricing:
denom = 1000.0
input_price = pricing["input_per_1k"]
output_price = pricing["output_per_1k"]
else:
denom = 1_000_000.0
input_price = pricing["input_per_1M"]
output_price = pricing["output_per_1M"]
input_cost = (input_tokens / denom) * input_price
output_cost = (output_tokens / denom) * output_price
return round(input_cost + output_cost, 6)
def measure_call(self, provider_key: str, client: LLMClient, user_message: str, history: List[Dict[str, str]] | None = None, **kwargs: Any) -> LLMCallResult:
"""Call the LLM client, measuring latency and computing cost from usage tokens."""
start = time.time()
result = client.chat_with_usage(user_message, history=history, **kwargs)
end = time.time()
response = result["raw"]
latency_ms = (end - start) * 1000.0
usage = getattr(response, "usage", None)
if usage is None:
input_tokens = 0
output_tokens = 0
cost_usd = 0.0
else:
input_tokens = usage.prompt_tokens
output_tokens = usage.completion_tokens
cost_usd = self.calculate_cost(provider_key, input_tokens, output_tokens)
4.6 Add pricing information
Providers do not put pricing information is not available in public APIs. This could be because of frequent changes in pricing, models going out of support and
I hard coded the pricing from different sources and is pinned to last_updated dates. Here is an example of the pricing information of the Deepseek reasoner model.
"deepseek:deepseek-reasoner": {
"input_per_1M": 0.28,
"output_per_1M": 0.42,
"last_updated": "2026-02-25",
}
The token counts and costs are approximate across providers due to tokenizer and pricing differences.
5. Benchmark Observations: Latency, Tokens, and Cost
To evaluate providers objectively, I captured the following metrics per request:
- Total latency (milliseconds)
- Input tokens
- Output tokens
- Estimated cost per request
- Cost per 1K output tokens
- Latency per 1K output tokens
I used the below prompt for all models. The latency values shown in the table are the P95 and the token usage values are median of 10 runs with this prompt. Benchmarking output is based on the pricing version used.
In what ways might the concept of 'free will' intersect with advancements in artificial intelligence, particularly concerning decision-making and moral responsibility?
| Provider | Model | Latency (ms) - P95 | Input Tokens (median) | Output Tokens (median) | Est. Cost / Request | Cost / 1K output tokens | Latency/ 1K output tokens (ms) |
|---|---|---|---|---|---|---|---|
| OpenAI | openai:gpt-4o-mini | 10395.56 | 33 | 661 | $0.000402 | $0.000608 | 15,725 |
| Gemini | gemini:gemini-2.5-flash | 16592.97 | 28 | 1634 | $0.004093 | $0.002506 | 10,154 |
| Anthropic | anthropic:claude-sonnet-4-5 | 16633.02 | 38 | 364 | $0.005574 | $0.015319 | 45,695 |
| Deepseek | deepseek:deepseek-reasoner | 36220.81 | 30 | 1395 | $0.000594 | $0.000426 | 25,964 |
| Groq | qwen/qwen3-32b | 6363.64 | 34 | 1921 | $0.000464 | $0.000242 | 3,312 |
| Ollama (Local) | ollama:lfm2.5-thinking | 17409.46 | 35 | 958 | Hardware Cost Only | 18,172 |
5.1 Key Observations
DeepSeek's tradeoff is clear : very low per-request cost but very high latency, consistent with it being positioned as a reasoning model - potentially better for offline/batch reasoning.
Groq was the best cost+speed option in these runs: lowest latency (~6.4s) while producing the longest output (1921 tokens) at a very low per-request cost ($0.000464). For real-time applications (chatbots, APIs), Groq could be an ideal choice.
At 1M requests/month, Groq is cheaper than OpenAI by cost per output token but outputs 2.9x more tokens. Claude Sonnet outputs half the tokens but costs 12x more. Provider selection should factor in output verbosity.
Claude prioritizes depth over efficiency: While it produced the shortest response (364 tokens), it has the highest per-token cost ($0.0153/k) and slowest throughput (45.7s per 1k tokens), suggesting it's best reserved for complex reasoning tasks where quality justifies the cost.
Token variance: Gemini outputs .25x more tokens (1634) than Anthropic (364) for the same prompt, despite having similar latency. For cost-sensitive or token-limited applications, Anthropic may be more reliable. Framework should monitor output lenght per provide as a quality signal.
5.2 How to use the above data for actual decisions
- Groq (qwen3-32b) : Real-time chatbots, high volume APIs, cost sensitive workloads.
- Gemini (2.5 flash) : Balanced latency, cost and output quality. Good for general purpose tasks with flexible output length tolerance
- OpenAI (gpt-40-mini) : Mid-cost, moderate reasoning
- Claude Sonnet : Complex reasoning tasks, code generation, high-quality outputs where token cost is acceptable
- DeepSeek (reasoner) : Batch processing, offline reasoning tasks. Avoid for latency sensitive use cases.
- Local (Ollama) : Privacy/compliance requirements, no APIs costs and limited reasoning.
6. What's Next
I am working on enhancing framework with the following capabilities -
- Intellingent routing (manual to rules based) - Route requests to the cheapest provider that meets latency SLA. This can have 30-50% cost reduction at scale.
- Fallback and retry logic - Add exponential backoff and provider failover. For example, if Groq times out, automatically retry with Gemini. This improves reliability.
- Streaming and Tool calling - Current scope is limited to non-streamed text only. Extend to streaming resposes and structured output.
Hope you find this approach useful and relevant while choosing and switching between models. Let me know your feedback in the comments.
Top comments (0)