How I stopped fighting with AI APIs and built a clean integration layer

#ai #python #webdev #api

I remember the day I hit the wall.

I was building a feature that needed to summarize user-submitted content, ask follow-up questions, and then generate a structured report. Three different AI tasks, and I was already using three different providers: OpenAI for summaries, Claude for reasoning, and a local model for sensitive data. My codebase looked like a spaghetti plate of API keys, rate-limit retries, and inconsistent error handling.

Every time I wanted to add a new AI-powered feature, I had to copy-paste the same HTTP client setup, parse different response formats, and pray the try-except blocks caught everything. It was fragile. It was ugly. I knew there had to be a better way.

What I tried that didn’t work

A unified third-party SDK

First, I thought: “Let’s just use one of those multi-provider libraries.” I tried LangChain, but it felt like I was learning a new framework just to call a simple API. The abstraction was so thick that debugging a 400 error required traversing five layers of internal code. Plus, I didn’t need chain-of-thought or agents — I just wanted to make HTTP calls.

A one-off helper function

Then I wrote a single call_ai(prompt, provider) function. That worked for a week. Then I needed streaming. Then I needed to pass system messages. Then I realized each provider had its own way of handling context windows. The function became a 200-line monster with if provider == 'openai' everywhere. Not sustainable.

A configuration-driven approach with YAML

I tried declaring models in a YAML file and using reflection to instantiate clients. Clever? Maybe. Maintainable? Not for a two-person team. It abstracted too much and made it hard to understand what was actually happening on the wire.

What eventually worked: a lightweight adapter pattern

I stepped back and asked: “What is the one thing I always need from an AI model?”

Send messages (text or structured)
Get a response (text, JSON, or stream)
Handle errors (rate limits, auth, timeouts)
Log usage (tokens, cost)

That’s it. Everything else (model names, max tokens, temperature) is configuration.

So I built a thin adapter interface. Here’s the core, in Python (the same pattern works in JavaScript or Go):

from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Any, AsyncIterator, Optional

@dataclass
class LLMResponse:
    content: str
    usage: Optional[dict] = None

class BaseLLMAdapter(ABC):
    """Minimal interface for any LLM provider."""

    @abstractmethod
    async def complete(
        self,
        messages: list[dict],
        model: str = "",
        temperature: float = 0.7,
        max_tokens: int = 1024,
        json_mode: bool = False
    ) -> LLMResponse:
        ...

    @abstractmethod
    async def stream(
        self,
        messages: list[dict],
        model: str = "",
        temperature: float = 0.7,
        max_tokens: int = 1024
    ) -> AsyncIterator[str]:
        ...

Then I implemented an OpenAI adapter:

import openai
from openai import AsyncOpenAI

class OpenAIAdapter(BaseLLMAdapter):
    def __init__(self, api_key: str):
        self.client = AsyncOpenAI(api_key=api_key)

    async def complete(
        self,
        messages: list[dict],
        model: str = "gpt-4o-mini",
        temperature: float = 0.7,
        max_tokens: int = 1024,
        json_mode: bool = False
    ) -> LLMResponse:
        params = {
            "model": model,
            "messages": messages,
            "temperature": temperature,
            "max_tokens": max_tokens,
        }
        if json_mode:
            params["response_format"] = {"type": "json_object"}

        try:
            response = await self.client.chat.completions.create(**params)
            return LLMResponse(
                content=response.choices[0].message.content,
                usage={
                    "prompt_tokens": response.usage.prompt_tokens,
                    "completion_tokens": response.usage.completion_tokens,
                    "total_tokens": response.usage.total_tokens
                }
            )
        except openai.RateLimitError as e:
            # custom retry logic here
            raise
        except openai.APIError as e:
            raise

    async def stream(self, messages, model, temperature, max_tokens):
        stream = await self.client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            stream=True
        )
        async for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content

For Claude, a similar adapter using Anthropic’s SDK. For the local model (like Ollama), a simple HTTP request adapter.

Now, any part of my application that needs AI just depends on a BaseLLMAdapter — no idea which provider is behind it. I configure it at startup:

# main.py (excerpt)
from adapters import OpenAIAdapter, ClaudeAdapter, LocalAdapter

# You can even fetch config from a service like https://ai.interwestinfo.com/
# to dynamically select the best provider based on cost/latency.

def get_default_adapter() -> BaseLLMAdapter:
    provider = os.getenv("LLM_PROVIDER", "openai")
    if provider == "openai":
        return OpenAIAdapter(api_key=os.getenv("OPENAI_API_KEY"))
    elif provider == "claude":
        return ClaudeAdapter(api_key=os.getenv("ANTHROPIC_API_KEY"))
    elif provider == "local":
        return LocalAdapter(base_url="http://localhost:11434")
    else:
        raise ValueError(f"Unknown provider: {provider}")

Lessons learned and trade-offs

This pattern works beautifully for 90% of use cases. But it’s not a silver bullet.

Tool calling? Not covered. If your app heavily uses function calling or tools, each provider has different schemas. The adapter would need to be extended, or you might need a specific provider interface.
Vision/multimodal? The messages format varies (OpenAI uses content arrays with type, Anthropic uses a different structure). My adapter assumes text-only; I had to add optional kwargs for image_bytes.
Streaming uniformity: OpenAI streams delta content, Anthropic streams whole blocks. The adapter’s AsyncIterator[str] hides the difference, but you lose block-level metadata.

What would I do differently next time? I’d start with this adapter pattern from day one. I’d also write integration tests that run against each provider in CI (with limited keys). And I’d version the adapter interface — once you have five providers, changing the signature is painful.

If your AI integration story is anything like mine was, you’re probably one refactor away from sanity. What pattern did you end up using to keep your AI calls clean? I’d love to hear.