DEV Community

RileyKim
RileyKim

Posted on

How I Built a Production AI Tutor Without Vendor Lock-in

How I Built a Production AI Tutor Without Vendor Lock-in

Last quarter my cofounder showed me the OpenAI bill and I nearly choked on my cold brew. We were running a tutoring platform that did adaptive explanations, step-by-step math walkthroughs, and conversational Socratic dialogue for about 12,000 active learners. Our monthly GPT-4o spend was north of what I was willing to admit in a blog post. The product was working. The unit economics were not.

So I did what any responsible (read: paranoid) early-stage CTO does. I tore the abstraction layer off our LLM calls, stared at the routing logic, and asked one question: can we hit the same quality bar for half the spend, while keeping the door open to swap providers in an afternoon? That little refactor became the most important architecture decision we made all year, and it is the story I want to tell here.

This is the playbook I wish someone had handed me before I wrote my first production prompt: the models I evaluated, the real numbers behind the benchmarks, the code that actually shipped, and the handful of production-ready patterns that kept our burn rate sane while our quality went up.

The Wake-Up Call: Why GPT-4o Stopped Making Sense

I want to be honest about something. GPT-4o is a phenomenal model. At $2.50 per million input tokens and $10.00 per million output tokens with a 128K context window, it is the kind of default that engineers reach for without thinking. And that is exactly the problem.

When you are prototyping, defaults are great. When you are serving tens of millions of tokens a day to paying customers, defaults become a tax. The math was unforgiving. Our tutoring sessions averaged around 2,400 output tokens per conversation (explanations are verbose when you actually teach instead of just answering), and our input was about 1,800 tokens once you factored in system prompts, prior turns, and the learner's working memory. Multiply that across 12,000 daily actives and the per-student cost was eating our margin alive.

The deeper issue, though, was architectural. A direct, hard-coded dependency on one provider is vendor lock-in in its purest form. If OpenAI raises prices, we eat it. If they have an outage in our region, we are dark. If a competing model wins on quality for our specific workload, we cannot pivot without a rewrite. None of that is acceptable at scale.

I needed a unified interface, a menu of models, and the ability to A/B test in production. That is what pushed me to standardize everything on Global API, which routes to 184 models behind a single OpenAI-compatible endpoint. One SDK swap, and the entire model layer became a config change rather than a refactor.

The Models I Actually Tested

I spent two weeks running the same eval suite against every model that looked remotely interesting. Below is the short list that survived. I am keeping the exact pricing and context windows as published, because that is the whole point of this exercise: knowing the real numbers.

Model Input ($/M) Output ($/M) Context
DeepSeek V4 Flash 0.27 1.10 128K
DeepSeek V4 Pro 0.55 2.20 200K
Qwen3-32B 0.30 1.20 32K
GLM-4 Plus 0.20 0.80 128K
GPT-4o 2.50 10.00 128K

The pricing on the Global API catalog ranges from $0.01 to $3.50 per million tokens across the full set of 184 models, which gives you a sense of the headroom you have once you stop defaulting to the most expensive option in the room.

DeepSeek V4 Flash quickly became my workhorse. At $0.27 input and $1.10 output with a 128K window, it handled roughly 70% of our traffic without breaking a sweat. The step-by-step math explanations, the conceptual Socratic prompts, the misconception diagnosis, all of it. DeepSeek V4 Pro I reserve for the long-context cases where a learner uploads a 60-page chapter and wants a guided reading. The 200K window and the slightly higher quality on dense reasoning earn the $2.20 output premium there.

Qwen3-32B I kept around for multilingual sessions, and GLM-4 Plus turned out to be a delightful surprise at $0.20 input and $0.80 output, perfect for the cheap-and-cheerful first-pass answers where we would have historically over-spent on a flagship model.

The Code That Actually Shipped

Here is the production-ready client setup. I went with the OpenAI Python SDK because every engineer on my team already knew it, and because the drop-in compatibility meant zero retraining.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def generate_tutor_response(
    learner_message: str,
    history: list[dict],
    difficulty: str = "intermediate",
) -> str:
    system_prompt = (
        "You are a patient tutor. Never give the final answer. "
        "Ask guiding questions, surface misconceptions, and adapt "
        "to the learner's stated difficulty level."
    )
    messages = [{"role": "system", "content": system_prompt}]
    messages.extend(history)
    messages.append({"role": "user", "content": learner_message})

    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Flash",
        messages=messages,
        temperature=0.4,
        max_tokens=900,
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

That is the happy path. The real production win, though, is the router. I built a small dispatcher that picks the right model based on intent classification, and that is where the ROI shows up. Something like this:

def route_tutor_request(user_message: str, history: list[dict]) -> str:
    total_tokens = sum(len(m["content"]) // 4 for m in history) + len(user_message) // 4

    if "explain" in user_message.lower() or len(user_message) > 1500:
        return "deepseek-ai/DeepSeek-V4-Pro"

    if any(kw in user_message.lower() for kw in ["quick", "define", "what is"]):
        return "glm-4-plus"

    if total_tokens > 80_000:
        return "deepseek-ai/DeepSeek-V4-Pro"

    return "deepseek-ai/DeepSeek-V4-Flash"
Enter fullscreen mode Exit fullscreen mode

This is the part that unlocked the savings. Same SDK, same base URL, completely different model per request. When a learner asks "what is a prime number," they get a fast, cheap answer from GLM-4 Plus. When they paste a 4,000-token physics problem, they get DeepSeek V4 Pro. The expensive GPT-4o is now reserved for the 3% of queries where we have empirically seen quality gaps, and even those are A/B tested weekly.

The Production Patterns That Saved Our Burn Rate

Here are the things that actually moved the needle at scale. None of them are glamorous. All of them are why we are still in business.

Aggressive caching. Roughly 40% of our incoming questions are repeats of common topics, especially at the start of a school term. A simple semantic cache in front of the LLM call, with a generous TTL, cut a huge slice of our spend overnight. If you are not caching at the prompt level, you are leaving ROI on the table.

Streaming responses. I will never go back. Streaming does not change the token cost, but it changes the perceived latency dramatically, and learner patience in a tutoring context is finite. Our time-to-first-token dropped from 1.2 seconds for a non-streamed response to about 280 milliseconds with streaming. Same model, same cost, vastly better UX.

Smart tiering. For the easy 50% of queries, we route to a GA-Economy tier and watch the cost line item fall by half. The trick is being honest about which queries are actually easy. We use a tiny classifier for that, not vibes.

Quality monitoring. Every response gets scored on a few dimensions: did it follow the Socratic system prompt, did it avoid giving the answer, did it match the learner's level. We sample 2% of traffic and have a human review queue. This is how we caught a regression in our long-context model swap before it hit customers.

Graceful fallback. Rate limits happen. Vendor outages happen. The router has a try/except chain that fails over to the next-cheapest model in under 800 milliseconds. The learner never sees an error, and the cost of redundancy is essentially zero because we are using the same unified endpoint.

The Numbers I Actually Trust

I am allergic to benchmarks that look pretty in slides and fall apart in production. Here is what we measured end-to-end on real learner traffic over a 30-day window.

The cost reduction landed between 40% and 65% depending on the cohort, with the heavier savings on the queries we successfully tiered. Our p50 latency sits at 1.2 seconds end-to-end, and throughput averages 320 tokens per second per model. The aggregate benchmark score across the eval suite is 84.6%, which is within a hair of what GPT-4o was delivering for us, at a fraction of the spend.

Those are the numbers that show up on the finance dashboard, and they are the numbers that justified the entire refactor. If you are a CTO staring at a ballooning LLM line item, this is the shape of the move that pays off.

What I Would Do Differently

I waited too long. I told myself for six months that switching off GPT-4o was a "next quarter" problem, and every month my burn rate climbed. The actual migration took my team about two weeks once we committed, and the unified SDK pattern meant most of that time was testing, not rewriting. The setup from a cold start, had I started with Global API on day one, would have been under 10 minutes. The interface is OpenAI-compatible, the auth is a single env var, and the model catalog is browsable in the dashboard.

The other thing I would change is my evaluation discipline. I shipped the router before I had a proper offline eval harness, and we paid for that with a week of noisy A/B results. If you take one piece of advice from this post, build the eval first, then ship the model swap. Your future self will thank you.

The Architecture Decision That Ties It All Together

If there is one thing I want you to walk away with, it is this. At scale, the LLM you pick is not really a model decision. It is an architecture decision. The model is a config value. The provider is a config value. The router, the cache, the fallback, the eval suite, those are the things that compound over time.

I sleep better now because I can swap any model in our stack in a single PR, route around regional outages automatically, and prove to my investors that our unit economics are getting better with every model release. That is the real ROI, and that is the position I want every startup CTO reading this to be in.

If you are wrestling with your own LLM bill or staring down a vendor lock-in problem, Global API is worth a look. One key, 184 models, OpenAI-compatible SDK, pricing that actually scales with your traffic rather than punishing it. They offer 100 free credits to get started, which is more than enough to run your own eval suite and see the numbers for yourself. I am not on their payroll, I am just a CTO who finally stopped hemorrhaging margin on defaults.

Top comments (0)