Souvik Sengupta

Posted on Mar 8

Building a Provider-Agnostic LLM Abstraction Layer: Benchmarking OpenAI, Gemini, Groq, DeepSeek and Ollama

#llm #ai #architecture #python

Building a Provider-Agnostic LLM Abstraction Layer: Benchmarking OpenAI, Gemini, Groq, DeepSeek and Ollama

1. Introduction – Why Multi-LLM Architecture Matters

While experimenting with multiple LLM providers like OpenAI, Gemini, Groq and local models via Ollama, I noticed a recurring architectural concern: tight coupling to a single provider SDK creates long-term rigidity.
A provider-agnostic abstraction layer lets you optimize for cost and performance at runtime without code changes.

2. Why do we need to solve this problem now

The LLM ecosystem is evolving rapidly. New providers are emerging, model capabilities are improving monthly, and pricing and latency characteristics vary significantly across platforms. A company committed to a single provider (OpenAI) gets locked in: if pricing increases 3x or a competitor releases a 10x faster model, switching costs are prohibitive. At scale, switching providers can save 30-50% on costs - but only if your architecture allows it. Without an abstraction layer, provider switching requires weeks of refactoring. With it, it's a configuration change. In 2026 this abstraction layer went from nice-to-have to essential infrastructure.
Another reason for switching provider and running models locally could be privacy and compliance with the tradeoffs of hardware constraint and possibly weaker reasoning.
The abstraction layer proposed here serves the goal of addressing the above challenges. It guarantees a standard contract irrespective of the provider and model. The current scope does not include streaming and tool calling.

2.1 What are we solving vs. what we are not

✅ Non-streamed text generation across 6+ providers
✅ Latency and cost benchmarking
✅ Provider-agnostic interface with factory pattern
❌ Streaming responses
❌ Tool calling and function calling
❌ Rate limit handling and circuit breaker

3. Problem Statement

3.1 Vendor Lock-In

Directly embedding provider SDK calls into application logic creates tight coupling. Switching providers later requires widespread refactoring.

3.2 Inconsistent API Interfaces

Each provider uses different request structures, parameter names, and response formats. Streaming behavior and token accounting also differ.

3.3 Operational Differences

Latency, rate limits, cost per token, and reasoning quality vary across providers. An architecture that cannot easily switch providers limits experimentation and optimization.

3.4 Security & Secret Management

Managing multiple API keys, isolating environments, and protecting prompt data becomes more complex in multi-provider setups.

3.5 Local vs Hosted Tradeoffs

Local models via Ollama provide privacy and control but introduce hardware constraints and potentially reduced reasoning capability.

4. Architecture: Designing the Abstraction Layer

4.1 Create your API keys

Firstly, you will need to create API keys for the providers you are trying to test. In my case I created keys for OpenAI, Gemini, Groq, Anthropic and DeepSeek.
For local development you can store them in a .env file. For Production use cases you will need to store them secrets manager applications like AWS Secrets Manager or GCP Secrets Manager.

4.2 Create a Provider Interface class

from typing import List, Dict, Any, Protocol

class ProviderAdapter(Protocol):
    """Provider-agnostic interface implemented by all provider adapters."""

    def chat(self, messages: List[Dict[str, str]], **kwargs: Any) -> Dict[str, Any]:
        """Send a chat completion request and return a normalized response."""
        ...

    def get_model_name(self) -> str:
        ...

    def get_model_description(self) -> str:
        ...

4.3 Create a client class

Next create client class with a method to measure usage. Below is some code snippet -

def chat_with_usage(self, user_message: str, history: List[Dict[str, str]] | None = None, **kwargs: Any) -> Dict[str, Any]:
        """Return normalized response including raw SDK response for usage metrics."""
        if history is None:
            history = []
        messages = history + [{"role": "user", "content": user_message}]
        return self._adapter.chat(messages, **kwargs)

The client should then invoke the specific provider class based on the provider.

def create_llm_client(provider: str) -> LLMClient:
    """Simple factory for creating an LLMClient for a given provider name."""
    provider = provider.lower()

    if provider == "openai":
        adapter = OpenAIAdapter()
    elif provider == "gemini":
        adapter = GeminiAdapter()
    elif provider == "anthropic":
        adapter = AnthropicAdapter()
    elif provider == "deepseek":
        adapter = DeepSeekAdapter()
    elif provider == "ollama":
        adapter = OllamaAdapter()
    elif provider == "groq":
        adapter = GroqAdapter()
    else:
        raise ValueError(f"Unknown provider: {provider}")

4.4 Create the adapter class

Below is an example of Gemini Adapter. We can create similar adapaters for OpenAI, Groq and DeepSeek. One thing to keep in mind is - Anthropic does not support OpenAI SDK.

class GeminiAdapter:
    """Adapter for Gemini using the OpenAI-compatible endpoint."""

    def __init__(self, model: str = "gemini-2.5-flash"):
        self._client = GeminiClient(
            api_key=os.getenv("GOOGLE_API_KEY"),
            base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
        )
        self._model = model

    def chat(self, messages: List[Dict[str, str]], **kwargs: Any) -> Dict[str, Any]:
        response = self._client.chat.completions.create(
            model=self._model,
            messages=messages,
            **kwargs,
        )

        return {
            "text": response.choices[0].message.content,
            "raw": response,
        }

    def get_model_name(self) -> str:
        return self._model

    def get_model_description(self) -> str:
        return "Gemini 2.5 Flash, fast general-purpose model"

4.5 Create a class to calculate pricing and latency which will be used for benchmarking.

def calculate_cost(self, provider_key: str, input_tokens: int, output_tokens: int) -> float:
        pricing = self.pricing_table[provider_key]

        if "input_per_1k" in pricing:
            denom = 1000.0
            input_price = pricing["input_per_1k"]
            output_price = pricing["output_per_1k"]
        else:
            denom = 1_000_000.0
            input_price = pricing["input_per_1M"]
            output_price = pricing["output_per_1M"]

        input_cost = (input_tokens / denom) * input_price
        output_cost = (output_tokens / denom) * output_price
        return round(input_cost + output_cost, 6)

def measure_call(self, provider_key: str, client: LLMClient, user_message: str, history: List[Dict[str, str]] | None = None, **kwargs: Any) -> LLMCallResult:
        """Call the LLM client, measuring latency and computing cost from usage tokens."""
        start = time.time()
        result = client.chat_with_usage(user_message, history=history, **kwargs)
        end = time.time()

        response = result["raw"]
        latency_ms = (end - start) * 1000.0

        usage = getattr(response, "usage", None)
        if usage is None:
            input_tokens = 0
            output_tokens = 0
            cost_usd = 0.0
        else:
            input_tokens = usage.prompt_tokens
            output_tokens = usage.completion_tokens
            cost_usd = self.calculate_cost(provider_key, input_tokens, output_tokens)

4.6 Add pricing information

Providers do not put pricing information is not available in public APIs. This could be because of frequent changes in pricing, models going out of support and

I hard coded the pricing from different sources and is pinned to last_updated dates. Here is an example of the pricing information of the Deepseek reasoner model.

"deepseek:deepseek-reasoner": {
        "input_per_1M": 0.28,
        "output_per_1M": 0.42,
        "last_updated": "2026-02-25",
    }

The token counts and costs are approximate across providers due to tokenizer and pricing differences.

5. Benchmark Observations: Latency, Tokens, and Cost

To evaluate providers objectively, I captured the following metrics per request:

Total latency (milliseconds)
Input tokens
Output tokens
Estimated cost per request
Cost per 1K output tokens
Latency per 1K output tokens

I used the below prompt for all models. The latency values shown in the table are the P95 and the token usage values are median of 10 runs with this prompt. Benchmarking output is based on the pricing version used.

In what ways might the concept of 'free will' intersect with advancements in artificial intelligence, particularly concerning decision-making and moral responsibility?

Provider	Model	Latency (ms) - P95	Input Tokens (median)	Output Tokens (median)	Est. Cost / Request	Cost / 1K output tokens	Latency/ 1K output tokens (ms)
OpenAI	openai:gpt-4o-mini	10395.56	33	661	$0.000402	$0.000608	15,725
Gemini	gemini:gemini-2.5-flash	16592.97	28	1634	$0.004093	$0.002506	10,154
Anthropic	anthropic:claude-sonnet-4-5	16633.02	38	364	$0.005574	$0.015319	45,695
Deepseek	deepseek:deepseek-reasoner	36220.81	30	1395	$0.000594	$0.000426	25,964
Groq	qwen/qwen3-32b	6363.64	34	1921	$0.000464	$0.000242	3,312
Ollama (Local)	ollama:lfm2.5-thinking	17409.46	35	958	Hardware Cost Only		18,172

5.1 Key Observations

DeepSeek's tradeoff is clear : very low per-request cost but very high latency, consistent with it being positioned as a reasoning model - potentially better for offline/batch reasoning.
Groq was the best cost+speed option in these runs: lowest latency (~6.4s) while producing the longest output (1921 tokens) at a very low per-request cost ($0.000464). For real-time applications (chatbots, APIs), Groq could be an ideal choice.
At 1M requests/month, Groq is cheaper than OpenAI by cost per output token but outputs 2.9x more tokens. Claude Sonnet outputs half the tokens but costs 12x more. Provider selection should factor in output verbosity.
Claude prioritizes depth over efficiency: While it produced the shortest response (364 tokens), it has the highest per-token cost ($0.0153/k) and slowest throughput (45.7s per 1k tokens), suggesting it's best reserved for complex reasoning tasks where quality justifies the cost.
Token variance: Gemini outputs .25x more tokens (1634) than Anthropic (364) for the same prompt, despite having similar latency. For cost-sensitive or token-limited applications, Anthropic may be more reliable. Framework should monitor output lenght per provide as a quality signal.

5.2 How to use the above data for actual decisions

Groq (qwen3-32b) : Real-time chatbots, high volume APIs, cost sensitive workloads.
Gemini (2.5 flash) : Balanced latency, cost and output quality. Good for general purpose tasks with flexible output length tolerance
OpenAI (gpt-40-mini) : Mid-cost, moderate reasoning
Claude Sonnet : Complex reasoning tasks, code generation, high-quality outputs where token cost is acceptable
DeepSeek (reasoner) : Batch processing, offline reasoning tasks. Avoid for latency sensitive use cases.
Local (Ollama) : Privacy/compliance requirements, no APIs costs and limited reasoning.

6. What's Next

I am working on enhancing framework with the following capabilities -

Intellingent routing (manual to rules based) - Route requests to the cheapest provider that meets latency SLA. This can have 30-50% cost reduction at scale.
Fallback and retry logic - Add exponential backoff and provider failover. For example, if Groq times out, automatically retry with Gemini. This improves reliability.
Streaming and Tool calling - Current scope is limited to non-streamed text only. Extend to streaming resposes and structured output.

Hope you find this approach useful and relevant while choosing and switching between models. Let me know your feedback in the comments.

DEV Community

Building a Provider-Agnostic LLM Abstraction Layer: Benchmarking OpenAI, Gemini, Groq, DeepSeek and Ollama

Building a Provider-Agnostic LLM Abstraction Layer: Benchmarking OpenAI, Gemini, Groq, DeepSeek and Ollama

1. Introduction – Why Multi-LLM Architecture Matters

2. Why do we need to solve this problem now

2.1 What are we solving vs. what we are not

3. Problem Statement

3.1 Vendor Lock-In

3.2 Inconsistent API Interfaces

3.3 Operational Differences

3.4 Security & Secret Management

3.5 Local vs Hosted Tradeoffs

4. Architecture: Designing the Abstraction Layer

4.1 Create your API keys

4.2 Create a Provider Interface class

4.3 Create a client class

4.4 Create the adapter class

4.5 Create a class to calculate pricing and latency which will be used for benchmarking.

4.6 Add pricing information

5. Benchmark Observations: Latency, Tokens, and Cost

5.1 Key Observations

5.2 How to use the above data for actual decisions

6. What's Next

Top comments (0)