Simangaliso Vilakazi

Posted on Jun 27

How I Replaced Gemini with a Self-Hosted LLM for Two Production Apps

#ai #selfhosted #nextjs #ollama

A while back I wrote about my terminal-inspired portfolio and the products it indexes. Two of those products lean on a language model: the portfolio terminal at smngvlkz.com that you can ask questions, and PayChasers, which generates OPTIONAL payment follow-up emails. Both started on Google's Gemini 3 Flash. Both now run on a model I host myself, with a fallback chain that keeps them alive when my hardware is not.

This is the story of that move. The experiment that started it, why I committed to it, what the architecture looks like, the night it broke, and the parts I still have not solved.

It started as an experiment

When Qwen 3.5 was announced, it made me curious about how far open models have actually come. Instead of reading benchmarks, I tested it the way I like to learn things, by running it.

It began as a small experiment on my base Mac mini. I pulled Qwen through Ollama just to see how capable the model would be running directly on a local machine. The results were far better than I expected. Good enough that I stopped thinking of it as a toy and started thinking about production.

Why move off Gemini at all

Gemini 3 Flash worked. The integration was a few lines and the quality was good. So this was not a "the API is bad" story. It was three smaller pulls that added up.

The first was cost shape. PayChasers generates optional email drafts on demand, and every preview is a few thousand tokens of system prompt plus output. That is fine at zero users and a slow leak at volume. The marginal cost of an inference I run on a machine I already own is electricity.

The second was control and privacy. I wanted to choose the model, pin it, and change the prompt contract without a provider deprecating something underneath me. I also did not love sending client names and payment context to a third party when I did not have to.

The third was the economics of treating AI as infrastructure rather than a metered API. Once the model runs on hardware I control, it stops being a per-call expense and becomes shared infrastructure that multiple applications can use. The same inference server now powers two different products. That reframing is the whole point.

Getting it to production was the hard part

The original plan was to host the model on Oracle Cloud using one of their free Ampere ARM instances in the Johannesburg region. If you have ever tried to get one, you know the struggle. Free tier ARM capacity is brutally limited, and after more than 200 automated retry attempts across two days, I still could not get one.

So I pivoted. I wrote a lightweight reverse proxy, set up a Cloudflare Tunnel on one of my domains, and routed production traffic to the model running on my Mac at home. No ports opened on my home network, no static IP, just a tunnel from Cloudflare's edge to the machine on my desk.

There is an honest tension here worth naming. Privacy was one of my reasons for leaving Gemini, and routing client data to a box on my desk is its own tradeoff. The tunnel keeps inbound ports closed and lets Cloudflare terminate TLS at the edge, and the reverse proxy sits in front of Ollama rather than exposing it directly. But this is a single machine I own, not a vendor's hardened, multi-tenant platform, and tightening access control on that endpoint, a service token rather than relying on an obscure hostname, is firmly on the list. Self-hosting moves the privacy boundary onto you, it does not remove it.

It was meant to be temporary. The Oracle instance eventually did come through, but by then the home setup was working well, so I did not throw it away. Instead I kept the Mac mini as the primary and gave Oracle a different job, the always-on backup. More on that in a moment.

This was a small full-circle moment. The Linux and infrastructure fundamentals I picked up during my bootcamp days and years of self-teaching showed up in a real production context. Provisioning tunnels, configuring DNS, writing a proxy service, setting up persistent services. All of it coming together for something real.

One deliberate decision was to keep the infrastructure simple. There are a lot of frameworks and agent systems appearing in the space right now. I focused on straightforward tooling that solved the problems I actually had.

The shape of the system

The Mac mini, exposed through Cloudflare tunnel, is the primary. It is fast but it is not always on, because it is a machine in my home. The Oracle Cloud VM is the fallback. It is slower and smaller, but it stays up around the clock.

Every app talks to a thin client that knows about both, tries the fast one first, and silently falls back to the reliable one.

Vercel app
   |
   v
[ primary: Mac mini via Cloudflare tunnel ]  --fail/timeout-->  [ fallback: Oracle Cloud VM ]
        fast, not always on                                          slow, always on

The failover client

This is the whole idea in one function. Hit the primary with a timeout. If anything goes wrong, the status, the timeout, a dropped tunnel, fall through to the fallback.

const PRIMARY_URL = process.env.OLLAMA_PRIMARY_URL || "http://localhost:11434";
const FALLBACK_URL = process.env.OLLAMA_FALLBACK_URL || PRIMARY_URL;

async function fetchWithFallback(path: string, body: object): Promise<Response> {
    try {
        const res = await fetch(`${PRIMARY_URL}${path}`, {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify(body),
            signal: AbortSignal.timeout(15000),
        });
        if (!res.ok) throw new Error(`Primary failed (${res.status})`);
        return res;
    } catch {
        const res = await fetch(`${FALLBACK_URL}${path}`, {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify(body),
        });
        if (!res.ok) {
            const text = await res.text().catch(() => "Unknown error");
            throw new Error(`Ollama request failed (${res.status}): ${text}`);
        }
        return res;
    }
}

A few small choices that matter more than they look:

The primary gets a 15 second timeout, the fallback does not. The thinking was that the fallback's job is to answer at all, so I let it take its time. In practice that means an unbounded fetch, which can hang if Oracle is reachable but wedged. A long timeout would be the more defensible version of the same idea, and I have not added one yet
The catch swallows why the primary failed, no log, no signal. Fine for failing over, bad for diagnosing, and something I would tighten before I called this production hardened.
The fallback URL defaults to the primary, so the same code runs locally with one Ollama instance and no special config.
Failover is transparent. The caller never knows which machine answered.

Load-aware model selection

Running your own models means you also get to decide which model serves which request. I do a very simple version of routing based on how many requests are in flight.

let activeRequests = 0;

function selectModel(): string {
    // 1 request: best quality. 2+: lighter model that handles concurrency.
    return activeRequests > 1 ? FALLBACK_MODEL : PRIMARY_MODEL;
}

The intent is that a single visitor gets qwen3.5:latest, the better model, and the moment requests overlap, new ones drop to qwen2.5-coder:7b, which is lighter under concurrency. It is one counter and a ternary, the cost and quality trade off in miniature.

I will be honest about how well it actually works, because it is more idea than guarantee. activeRequests is a module-level counter, so on serverless it only sees concurrency inside a single warm instance, not across the fleet. Worse, in the streaming path it is decremented in a finally that runs when the function returns the Response, which is before the stream has finished generating. So for the streaming features, which is most of what these apps do, the counter is near zero almost all the time and the downgrade rarely fires. It works on the non-streaming path, where the count wraps the full call. Right now it is more a hook I reached for early than a load balancer that earns the name.

I also pass two Ollama options that earn their keep:

keep_alive: -1 keeps the model resident in memory so the next request does not pay the cold load.
think: false turns off the reasoning tokens, because for a portfolio terminal and an email draft I want the answer, not the monologue.

Not everything should hit the model

The cheapest inference is the one you never run. Previously my portfolio terminal used Gemini 3 Flash for natural language queries while common commands were handled locally without AI. I kept that split when I moved the natural language layer onto my own infrastructure.

const lowerQuery = query.toLowerCase().trim();

if (lowerQuery === "help") { /* return static command list */ }
if (lowerQuery === "list all") { /* return products + systems from data */ }
if (lowerQuery === "show activity") { /* return GitHub/GitLab stats */ }

const showMatch = lowerQuery.match(/^show\s+([\w-]+)\s+(\w+)$/);
if (showMatch) { /* answer straight from structured data */ }

// only open-ended natural language falls through to the model

help, list, show, and explain are answered straight from the typed data. Only genuinely open-ended questions stream from the model. It is faster, it is free, and it is more reliable than asking a 7B model to format a list it could get wrong.

Streaming the answer

For the open-ended path, the portfolio streams tokens over server-sent events. Ollama returns newline-delimited JSON, so the route reads the body, split on newlines, and re-emits each token as an SSE frame.

const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n");
    buffer = lines.pop() || "";
    for (const line of lines) {
        if (!line.trim()) continue;
        const chunk = JSON.parse(line);
        const token = chunk.message?.content || "";
        if (token) {
            controller.enqueue(encoder.encode(
                `data: ${JSON.stringify({ type: "token", content: token })}\n\n`
            ));
        }
    }
}

Both products stream responses token by token and run entirely on infrastructure I control.

Constraining the model so the output is usable

PayChasers is where the prompt work actually lives, because the output is not a chat bubble, it is an email that gets sent to someone's client. Two things make a self-hosted 7B model reliable enough for that.

First, the model never writes real values. It writes placeholders, and the app fills them in. This keeps the model from hallucinating an amount or a name.

CRITICAL: You MUST use these exact placeholder variables instead of real values:
- {clientName} for the recipient's name
- {dueDate} for the due date
- {amount} for the amount owed
- {daysOverdue} for the number of days overdue

For example: "Hey {clientName}," NOT "Hey John,".
Return ONLY valid JSON.

Second, the tone escalates with how late the payment is, decided in code, not left to the model's mood.

function determineTone(daysOverdue: number) {
    if (daysOverdue >= 14) return "urgent";
    if (daysOverdue >= 7) return "firm";
    return "friendly";
}

And because a local model will occasionally wrap its JSON in a code fence or stray <think> block no matter how firmly you ask, the parser is defensive rather than trusting.

function extractJson(text: string) {
    const cleaned = text.replace(/<think>[\s\S]*?<\/think>/g, "").trim();
    try { return JSON.parse(cleaned); } catch {}

    const fence = cleaned.match(/```(?:json)?\s*([\s\S]*?)```/);
    if (fence) { try { return JSON.parse(fence[1].trim()); } catch {} }

    const first = cleaned.indexOf("{");
    const last = cleaned.lastIndexOf("}");
    if (first !== -1 && last > first) {
        try { return JSON.parse(cleaned.slice(first, last + 1)); } catch {}
    }
}

Self-hosting a smaller model means you trade some of the provider's polish for parsing your own. That is a fair trade when the upside is control and cost.

The night the power went out

Then I learned the lesson that every self-hoster learns eventually.

There was a small power outage one night around 20:00. The Mac mini, my primary inference node, switched off, and it never came back on. I only realised the next morning.

PayChasers failed over to the Oracle backup automatically, exactly as it should have. But the floating terminal in my portfolio had no failover, so it just sat there dead all night. Anyone who was bored enough to try and poke at my portfolio that night got nothing.

Two lessons came out of that morning:

Every service that needs inference needs a failover. Not just the ones I remembered to set it up for. The portfolio terminal got the same fetchWithFallback client that PayChasers already had.
A 12-hour outage I did not even notice is a monitoring problem, not just me being forgetful. Mostly. Partly forgetful.

Self-hosting your own AI is great until you are the one on call at 8am on a Saturday, and there is no one else to escalate to, because it is your own thing

Knowing when the homelab is down

So I built the monitoring I should have had first. PayChasers runs a small cron that health-checks both Ollama endpoints and emails me, but only on state transition, up to down or down to up. It keeps the last known state in Upstash Redis so it does not spam me every 5 minutes while the mini is asleep.

const ENDPOINTS = [
    { name: "primary-mac", url: process.env.OLLAMA_PRIMARY_URL },
    { name: "fallback-oracle", url: process.env.OLLAMA_FALLBACK_URL },
];

// hit /api/tags on each, compare to stored state in Redis,
// send a Resend email only when ok flips. Auth via CRON_SECRET.

Now when the mini goes offline, traffic quietly shifts to Oracle and I get exactly one email telling me so. That is the entire operations story, and that is the amount of operations story I want for a side project.

What I have not solved

I want to be honest about the edges, because the architecture above is the easy part.

My evaluation is still vibes. I read the generated emails, decide they look good, and ship. I do not have an eval harness scoring tone, placeholder correctness, or JSON validity across a fixed set of cases. I should. When I claim qwen3.5 is "better" than qwen2.5-coder for a request, that is intuition, not a benchmark.

The irony is that the plumbing is already there. PayChasers runs PostHog for the product funnel, signups, chases created, upgrades. Capturing AI events would be trivial. A draft_generated, draft_accepted, draft_edited, draft_regenerated funnel would tell me, with real users, how often a generated email ships untouched versus gets rewritten. That acceptance rate is a real quality signal, and it is the cheapest first step from vibes towards measurement. I just have not wired it yet.

My model selection is instinct, not measurement. I picked these Qwen models because they ran well on my hardware and read well in practice. A systematic version would measure latency, quality, and cost per model and route on data.

And I have not touched retrieval. Both apps stuff their full context into the system prompt, which is fine at this size and would fall apart the moment the data outgrew the window. There is no RAG here, and I have not yet had to reach for it.

I am pointing at these on purpose. The move off Gemini taught me serving, the cost and reliability tradeoff, basic routing, and prompt constraining by doing them. The next layer, real evaluation and measured model choice, is the part I am learning now.

What it adds up to

Open models have come a long way. It is becoming genuinely practical to run useful AI systems on relatively small infrastructure. No GPU cluster required. What started as a small experiment on a base mini is now live for real users across two products, on infrastructure I own.

This is not a finished system. It is a snapshot of how I run a model I control today, and a map of what I am building next

DEV Community