Productionizing Ollama: Rate Limits, Cloud Fallback, and Cost Guardrails

#ai #typescript #webdev #programming

Productionizing Ollama: Rate Limits, Cloud Fallback, and Cost Guardrails

Running Ollama locally is easy. Running it in a production service that handles concurrent users without melting your box — that's a different problem.

I wrote up the basic Ollama + NeuroLink setup in Running Local LLMs with NeuroLink and Ollama: Complete Guide. This article is the follow-up: what happens after you ship it and it gets real traffic.

Three things break first: request queues pile up under concurrency, latency spikes on heavier models, and you have no budget guardrails because "it's free" turns out not to mean "it can't cause you problems." Here's how to solve all three.

The Problem: Ollama Has No Native Rate Limiting

OpenAI returns a 429 when you hit its rate limit. Ollama doesn't have a rate limit — it queues requests and processes them serially on whatever GPU you have. Five concurrent requests to llama3.1:70b on a single machine means the fifth request waits for the first four to finish.

In practice, your p99 latency goes from 4 seconds to 20 seconds and your users give up.

You need to impose your own rate limiting at the SDK layer before requests reach the Ollama process.

Pattern 1: Request Throttling via Middleware

NeuroLink's middleware system runs as a pipeline on every generate() call. A throttling middleware can reject or queue requests before they're dispatched to the provider:

import { NeuroLink } from "@juspay/neurolink";

// Simple token-bucket rate limiter
class TokenBucket {
  private tokens: number;
  private lastRefill = Date.now();

  constructor(
    private readonly capacity: number,
    private readonly refillRatePerSecond: number
  ) {
    this.tokens = capacity;
  }

  consume(): boolean {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.capacity,
      this.tokens + elapsed * this.refillRatePerSecond
    );
    this.lastRefill = now;

    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;
    }
    return false;
  }
}

const bucket = new TokenBucket(10, 2); // 10 burst, 2 req/sec sustained

const throttleMiddleware = {
  name: "ollama-throttle",
  priority: 120, // Runs before everything else
  transformParams: async (params: any) => {
    if (!bucket.consume()) {
      throw new Error("LOCAL_RATE_LIMIT: Ollama request queue full");
    }
    return params;
  },
};

const ai = new NeuroLink({
  provider: "ollama",
  model: "llama3.1",
  middleware: [throttleMiddleware],
});

The middleware throws a LOCAL_RATE_LIMIT error before the request reaches Ollama. Your calling code catches this and routes elsewhere — which brings us to the next pattern.

Pattern 2: Falling Back to Cloud When Local is Overloaded

This is the multi-provider fallback pattern from Building Resilient AI: Multi-Provider Fallback Patterns in TypeScript applied specifically to the Ollama overload scenario.

NeuroLink's fallbackChain handles provider-level failures automatically, but the throttle middleware above throws before the provider is even called. You need to catch that specific error and escalate.

Here's the full pattern:

import { NeuroLink } from "@juspay/neurolink";

// Primary: local Ollama with throttle
const localAI = new NeuroLink({
  provider: "ollama",
  model: "llama3.1",
  middleware: [throttleMiddleware],
});

// Fallback: cloud providers in priority order
const cloudAI = new NeuroLink({
  providers: [
    { name: "anthropic", model: "claude-3-5-haiku-20241022", priority: 1 },
    { name: "openai", model: "gpt-4o-mini", priority: 2 },
  ],
  fallbackChain: ["anthropic", "openai"],
});

async function generate(prompt: string) {
  try {
    return await localAI.generate({ input: { text: prompt } });
  } catch (err: any) {
    if (err.message?.startsWith("LOCAL_RATE_LIMIT")) {
      // Ollama queue full — route to cloud
      console.warn("Ollama saturated, routing to cloud");
      return await cloudAI.generate({ input: { text: prompt } });
    }
    throw err; // Re-throw unexpected errors
  }
}

const result = await generate("Summarize this support ticket...");
console.log(`Provider used: ${result.provider}`);

The critical thing here: you want Haiku or GPT-4o-mini as your cloud fallback, not Claude Sonnet or GPT-4o. The fallback scenario is "Ollama is busy" — you're handling overflow, not upgrading quality. Match the capability tier, not the price tier.

Pattern 3: Latency Budgets — Switching on Timeout

Queue saturation isn't the only signal that Ollama is struggling. A 70B model under thermal throttling might accept the request but take 30 seconds to answer. You need a latency budget.

NeuroLink's generate() accepts a timeout option (number ms or string like "8s") plus an abortSignal, and the FallbackConfig chain triggers on errors — including timeout errors. Combine both for a clean latency-budget pattern:

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink({
  providers: [
    {
      name: "ollama",
      model: "llama3.1",
      priority: 1,
    },
    {
      name: "anthropic",
      model: "claude-3-5-haiku-20241022",
      priority: 2,
      apiKey: process.env.ANTHROPIC_API_KEY,
    },
  ],
  fallbackConfig: {
    enabled: true,
    maxAttempts: 2, // ollama, then anthropic
    circuitBreaker: true,
  },
});

const result = await ai.generate({
  input: { text: prompt },
  timeout: 8000, // 8s budget for the call; throws → fallback chain takes over
});

// Log which provider actually served this request
if (result.provider !== "ollama") {
  console.warn(`Latency budget exceeded, fell back to ${result.provider}`);
  metrics.increment("ollama.latency_fallback");
}

Set your timeout conservatively. An 8-second budget for an interactive request is already too slow for chat. If you're building a real-time interface, consider 3-4 seconds and accepting that heavy models will frequently fall back. Batch processing can afford 15-30 seconds.

The timeout option applies to the whole generate() call. For a strict per-provider deadline (e.g., "give Ollama exactly 3 seconds before racing Claude"), wrap each provider's call in a Promise.race with your own AbortController — the SDK doesn't expose a per-provider timeout field directly.

Pattern 4: Cost Guardrails with the onFinish Hook

"Ollama is free" is true for the LLM calls themselves. It's not true for:

Cloud fallback calls (every Anthropic/OpenAI request costs money)
Your compute bill if you're running Ollama on cloud GPU instances
The engineering time debugging a service that's silently spending money

The onFinish lifecycle hook fires after every successful generation with usage data and provider info. Use it to track where your spend is going:

import { NeuroLink } from "@juspay/neurolink";

// Per-1K token pricing (cloud fallback providers)
const CLOUD_PRICING: Record<string, { input: number; output: number }> = {
  "claude-3-5-haiku-20241022": { input: 0.0008, output: 0.004 },
  "gpt-4o-mini": { input: 0.00015, output: 0.0006 },
};

let sessionCost = 0;
const BUDGET_ALERT_USD = 5.0; // Alert when session spend hits $5

const ai = new NeuroLink({
  providers: [
    { name: "ollama", model: "llama3.1", priority: 1 },
    {
      name: "anthropic",
      model: "claude-3-5-haiku-20241022",
      priority: 2,
      apiKey: process.env.ANTHROPIC_API_KEY,
    },
  ],
  fallback: true,
  fallbackConfig: { timeoutMs: 8000, retryAttempts: 1 },
  middleware: [
    {
      name: "cost-guard",
      onFinish: (result, metadata) => {
        // Ollama cost is effectively zero, but the hook still fires
        const pricing = CLOUD_PRICING[metadata.model] ?? { input: 0, output: 0 };
        const callCost =
          ((result.usage?.promptTokens ?? 0) / 1000) * pricing.input +
          ((result.usage?.completionTokens ?? 0) / 1000) * pricing.output;

        sessionCost += callCost;

        // Always log provider — visibility into fallback frequency is useful
        console.log(
          `[cost-guard] provider=${metadata.provider} ` +
          `model=${metadata.model} ` +
          `tokens=${result.usage?.totalTokens ?? 0} ` +
          `cost=$${callCost.toFixed(6)} ` +
          `session_total=$${sessionCost.toFixed(4)}`
        );

        if (metadata.provider !== "ollama") {
          metrics.increment("ollama.fallback_call", {
            provider: metadata.provider,
          });
        }

        if (sessionCost > BUDGET_ALERT_USD) {
          notifyOps(`Cloud fallback cost alert: $${sessionCost.toFixed(2)} this session`);
        }
      },
    },
  ],
});

Even when Ollama handles the request, this log line tells you your fallback rate. If 30% of requests are hitting cloud fallback, your Ollama instance is undersized for your traffic.

Putting It Together: A Production-Ready Ollama Service

Here's the complete pattern for a service that handles realistic traffic:

import { NeuroLink } from "@juspay/neurolink";

const CLOUD_PRICING = {
  "claude-3-5-haiku-20241022": { input: 0.0008, output: 0.004 },
  "gpt-4o-mini": { input: 0.00015, output: 0.0006 },
};

const bucket = new TokenBucket(10, 2);

export const ai = new NeuroLink({
  providers: [
    { name: "ollama", model: "llama3.1", priority: 1 },
    {
      name: "anthropic",
      model: "claude-3-5-haiku-20241022",
      priority: 2,
      apiKey: process.env.ANTHROPIC_API_KEY,
    },
  ],
  fallback: true,
  fallbackConfig: {
    timeoutMs: 8000,
    retryAttempts: 1,
  },
  middleware: [
    {
      name: "throttle",
      priority: 120,
      transformParams: async (params: any) => {
        if (!bucket.consume()) {
          throw new Error("LOCAL_RATE_LIMIT");
        }
        return params;
      },
    },
    {
      name: "cost-guard",
      onFinish: (result, metadata) => {
        const pricing = (CLOUD_PRICING as any)[metadata.model] ?? { input: 0, output: 0 };
        const cost =
          ((result.usage?.promptTokens ?? 0) / 1000) * pricing.input +
          ((result.usage?.completionTokens ?? 0) / 1000) * pricing.output;

        recordMetrics({
          provider: metadata.provider,
          model: metadata.model,
          tokens: result.usage?.totalTokens ?? 0,
          cost,
          duration: metadata.duration,
          wasLocal: metadata.provider === "ollama",
        });
      },
      onError: (error, metadata) => {
        logger.error("generation_failed", {
          provider: metadata.provider,
          error: error.message,
          recoverable: metadata.recoverable,
        });
      },
    },
  ],
});

export async function generateWithFallback(prompt: string) {
  try {
    return await ai.generate({ input: { text: prompt } });
  } catch (err: any) {
    if (err.message?.startsWith("LOCAL_RATE_LIMIT")) {
      // Explicit queue-full path: skip Ollama entirely, go straight to cloud
      return await new NeuroLink({
        providers: [
          {
            name: "anthropic",
            model: "claude-3-5-haiku-20241022",
            apiKey: process.env.ANTHROPIC_API_KEY,
          },
        ],
      }).generate({ input: { text: prompt } });
    }
    throw err;
  }
}

What to Watch in Production

A few metrics worth tracking:

ollama.fallback_rate: What percentage of requests don't complete on Ollama. Over 10% means your instance is undersized.
ollama.p95_latency: If your 70B model's p95 goes above your timeout threshold, you need a smaller model or more hardware.
cloud_fallback.cost_per_hour: Your actual cloud spend from overflow requests. This is your real Ollama infrastructure cost.
token_bucket.rejection_rate: How often you're hitting the local rate limit before even trying Ollama. A spike here usually means a burst of traffic, not a hardware problem.

The Ollama guide covers what to run. This setup covers what to watch after you run it.

Get started with NeuroLink: