DEV Community

Programming Central
Programming Central

Posted on • Originally published at programmingcentral.hashnode.dev

Stop Burning Money on AI: Cost Tracking & Rate Limiting for Local LLMs

Running Large Language Models (LLMs) locally offers incredible privacy and control, but it’s easy to spin up costs you didn’t anticipate. Just like a cloud API bills per token, your local LLM consumes valuable resources – CPU, GPU, memory, and even electricity. Without careful management, you risk system instability, poor user experience, and ultimately, wasted hardware. This post dives into the operational economics of local AI, showing you how to track costs and implement rate limiting to keep your LLM applications running smoothly and efficiently.

The Economics of Local AI: From Compute to Cash

We’ve all been there: a functional prototype that works beautifully… until multiple users hit it simultaneously. Integrating LLMs into your Node.js applications (using tools like Ollama and Transformers.js) is just the first step. To move to production, you need to treat inference as a finite resource with tangible costs.

Think of it like a database. A query consumes CPU cycles and I/O. An LLM request consumes computational power, memory bandwidth, and time. In the cloud, this translates directly to monetary cost per token. Locally, the cost manifests as hardware wear, electricity, and, crucially, the opportunity cost of blocking the system for other users. A runaway LLM process can easily bring a server to its knees.

Understanding the Costs: Token Usage, Latency & Hardware

Effective cost tracking requires understanding the key metrics. It’s more granular than traditional web application monitoring.

1. Token Throughput: The Core Metric

The most fundamental metric is token throughput (tokens per second - TPS). Break it down:

  • Input Tokens: Tokens in the user’s prompt and any retrieved context (from RAG – Retrieval Augmented Generation).
  • Output Tokens: Tokens generated by the model.

Why it matters: LLM execution time is proportional to sequence length. Generating 100 tokens takes roughly twice as long as 50. This non-deterministic latency demands a different mental model than standard HTTP request-response cycles.

Analogy: Imagine an LLM as a novelist. Input tokens are research notes, output tokens are written pages, and latency is the time to write. You can’t predict the book’s completion time until the final sentence. Similarly, you can’t know the exact inference cost until generation finishes.

2. Hardware Overhead: VRAM & Compute

In a local deployment, “cost” is physical. The primary constraints are:

  • VRAM (Video RAM): Memory for model weights and the "KV Cache" (Key-Value Cache).
  • Compute Units: GPU core or CPU vector instruction utilization.

The KV Cache Bottleneck: LLMs remember context by storing intermediate calculations (Keys and Values) in memory. This cache grows linearly with input and output tokens. An 8GB model on a 12GB GPU has 4GB headroom. But a 50,000-token document as context could fill that 4GB, causing an Out-Of-Memory (OOM) error – a costly crash.

Analogy: Think of a restaurant kitchen. Model weights are permanent appliances. The KV Cache is counter space for preparing a dish. A simple dish needs little space; a complex banquet needs all of it. Rate limiting prevents accepting orders that overwhelm the kitchen.

Rate Limiting: The Token Bucket Algorithm

Rate limiting protects your hardware from overload. Unlike cloud APIs that throttle based on credits, local servers throttle based on capacity. The Token Bucket algorithm is superior to simple "requests per minute" counters because it handles traffic bursts while enforcing a steady-state average.

How it works:

  1. The Bucket: A container with a maximum capacity (burst size).
  2. The Tokens: Represent "permission to process."
  3. The Refill Rate: Tokens are added at a fixed rate (e.g., 10 tokens/second).
  4. The Request: Consumes tokens. If the bucket is empty, the request is rejected (or queued).

Optimizing Performance: Batching & Context Management

Maximizing value requires optimizing hardware usage.

1. Request Batching

Modern inference engines (like Ollama) support dynamic batching – grouping multiple requests into a single inference call. GPUs process matrix multiplications in parallel. Sending four requests simultaneously fills the GPU cores more efficiently, increasing aggregate throughput, even if per-request latency increases slightly.

Analogy: A city bus is more efficient than individual taxis for transporting many passengers.

2. Context Window Management

The "Context Window" is the maximum text the model can consider (e.g., 4096 or 8192 tokens). In RAG, this is critical. Blindly pasting retrieved text can exceed the window or “drown out” the user’s question.

Optimization Strategies:

  • Re-ranking: Use a smaller model to score the relevance of retrieved chunks and keep only the top-N.
  • Summarization: Summarize long context before injection.

Code Example: TypeScript Token Bucket Rate Limiter

This example implements a Token Bucket algorithm with cost tracking in TypeScript, suitable for a Node.js backend.

// pages/api/chat.ts
// Next.js API Route Handler

import { NextApiRequest, NextApiResponse } from 'next';

interface RateLimiterConfig {
  capacity: number;
  refillRate: number;
  costPerRequest: number;
}

interface RateLimiterState {
  tokens: number;
  lastRefill: number;
}

interface CostMetrics {
  promptTokens: number;
  completionTokens: number;
  latencyMs: number;
  estimatedComputeCost: number;
}

interface ChatRequestPayload {
  query: string;
  userId: string;
}

const CONFIG: RateLimiterConfig = {
  capacity: 10,
  refillRate: 2,
  costPerRequest: 1,
};

const userStates = new Map<string, RateLimiterState>();

function refillBucket(state: RateLimiterState): RateLimiterState {
  const now = Date.now();
  const timePassed = (now - state.lastRefill) / 1000;
  const tokensToAdd = timePassed * CONFIG.refillRate;
  const newTokens = Math.min(CONFIG.capacity, state.tokens + tokensToAdd);
  return { ...state, tokens: newTokens, lastRefill: now };
}

function tryConsume(state: RateLimiterState): [RateLimiterState, boolean] {
  const refilledState = refillBucket(state);
  if (refilledState.tokens >= CONFIG.costPerRequest) {
    return [
      { ...refilledState, tokens: refilledState.tokens - CONFIG.costPerRequest },
      true,
    ];
  }
  return [refilledState, false];
}

async function handleRequest(req: NextApiRequest, res: NextApiResponse) {
  const { query } = req.body as ChatRequestPayload;
  const userId = req.body.userId || 'anonymous';

  let currentState = userStates.get(userId) || {
    tokens: CONFIG.capacity,
    lastRefill: Date.now(),
  };

  const [newState, allowed] = tryConsume(currentState);
  userStates.set(userId, newState);

  if (!allowed) {
    return res.status(429).json({ message: 'Too many requests' });
  }

  // Simulate LLM inference (replace with Ollama/Transformers.js call)
  const startTime = Date.now();
  const promptTokens = query.length / 4; // Rough estimate
  const completionTokens = 50; // Example
  const latencyMs = Math.random() * 400 + 100;
  await new Promise((resolve) => setTimeout(resolve, latencyMs));

  const costMetrics: CostMetrics = {
    promptTokens,
    completionTokens,
    latencyMs,
    estimatedComputeCost: 0.01, // Example cost
  };

  res.status(200).json({ response: `Processed: ${query}`, costMetrics });
}

export default handleRequest;
Enter fullscreen mode Exit fullscreen mode

Conclusion: Sustainable Local AI

Cost tracking and rate limiting aren’t just about preventing crashes; they’re about building sustainable local AI applications. By understanding the economics of inference and implementing these techniques, you can maximize the value of your hardware, deliver a consistent user experience, and avoid unexpected costs. Remember to adapt the configuration and metrics to your specific hardware and application needs. Don't just run your LLM – manage it.

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the book The Edge of AI. Local LLMs (Ollama), Transformers.js, WebGPU, and Performance Optimization Amazon Link of the AI with JavaScript & TypeScript Series.
The ebook is also on Leanpub.com: https://leanpub.com/EdgeOfAIJavaScriptTypeScript.

👉 Free Access now to the TypeScript & AI Series on Programming Central, it includes 8 Volumes, 160 Chapters and hundreds of quizzes for every chapter.

Top comments (0)