DEV Community

Cover image for Running AI models locally vs. via API: which should you choose?
Wanda
Wanda

Posted on • Originally published at apidog.com

Running AI models locally vs. via API: which should you choose?

TL;DR

Local AI runs on your hardware, costs nothing per request, and keeps data private. API-based AI is faster to start, more capable, and scales without infrastructure. Most teams need both. This guide covers when each approach wins, with concrete numbers.

Try Apidog today

Introduction

Gemma 4 running natively on an iPhone. A browser extension that embeds a full language model without an API key. These weren't possible 18 months ago. Today they're shipping on HackerNews.

The decision used to be simple: frontier models are API-only, everything else is too weak to matter. That's changed. Local models like Qwen2.5-72B, Gemma 4, and DeepSeek-V3 now compete on real benchmarks. Developers who previously defaulted to OpenAI's API are reconsidering, especially for privacy-sensitive applications or high-volume tasks where per-token costs compound fast.

This article cuts through the marketing. You'll get concrete numbers on cost, latency, and capability so you can make the right call for your use case.

💡 If you're testing AI API integrations regardless of whether the model is local or cloud, Apidog's Test Scenarios work with both. You can point them at a local llama-server endpoint or at OpenAI's /v1/chat/completions and run the same assertions. More on that later. See [internal: api-testing-tutorial] for the baseline testing approach.

What "running AI locally" actually means

Local AI can mean three setups:

  • On-device inference: The model runs entirely on the device, with no server. (E.g., Gemma Gem in a browser tab, Gemma 4 on iPhone Neural Engine, Ollama model on MacBook.) No internet needed after download.
  • Self-hosted server: Run a model on your own hardware (workstation, cloud VM you control, or on-premises server) and expose an API. Not on the user's device, but not at OpenAI either. Tools: llama-server, Ollama, vLLM.
  • Private cloud: Deploy on your own cloud infrastructure (AWS Bedrock custom models, Azure private endpoints, GCP Vertex AI custom models). More control vs. public API, less hassle than fully self-hosted.

This article compares self-hosted vs. public API, since that's the choice most developers face.

Cost comparison

Local AI is the clear winner for high-volume workloads.

Public API pricing (April 2026):

Model Input (per 1M tokens) Output (per 1M tokens)
GPT-4o $2.50 $10.00
Claude 3.5 Sonnet $3.00 $15.00
Gemini 1.5 Pro $1.25 $5.00
GPT-4o mini $0.15 $0.60
Claude 3 Haiku $0.25 $1.25

Self-hosted cost estimate (Qwen2.5-72B on A100 80GB):

  • A100 80GB (Lambda Labs): ~$1.99/hour on-demand.
  • Qwen2.5-72B at INT4 quantization fits on one A100, serves ~200 tokens/sec.
  • At 200 tokens/sec, that's 720K tokens/hour, or ~$0.0028 per 1K tokens total.
  • For comparison, GPT-4o charges $0.01 per 1K tokens output alone.

Break-even point: If you're processing >70K output tokens/day, self-hosted beats GPT-4o on cost. Under that, API is cheaper (no idle GPU payment).

For lighter models: a 4-bit quantized Gemma 4 (12B) runs on an RTX 4090 ($600-800 used). At $0.40/hour equivalent cloud GPU, self-hosting beats GPT-4o mini at ~15K output tokens/day.

Latency comparison

Latency depends on setup.

  • Time to first token (TTFT): On a dedicated A100, TTFT for a 1K-token prompt with a 72B model is ~800ms–1.5s. OpenAI API typically returns first token in 300-800ms for similar input.
  • On-device inference (iPhone/Apple Silicon): TTFT for Gemma 4 is 200-400ms (no network overhead).
  • Throughput: Single A100 with a 72B model at INT4 serves one user well, but throughput drops under concurrency unless you batch. Public APIs handle concurrency automatically.
  • Streaming: Both local and API support streaming. On-device: no network jitter. API: network conditions apply.

Summary:

  • On-device = lowest latency (no network)
  • Self-hosted = high throughput (with batching, e.g. vLLM)
  • Public API = best for burst capacity and easy scaling

Capability comparison

Public APIs still lead for most demanding tasks.

  • Reasoning/complex tasks: GPT-4o, Claude 3.5 Sonnet remain ahead of open-weight models on MMLU, HumanEval, and multi-step reasoning. Qwen2.5-72B and DeepSeek-V3 have narrowed the gap, but it's still there.
  • Code generation: DeepSeek-Coder-V2 and Qwen2.5-Coder-32B match GPT-4o on many code benchmarks. For code tasks, use a specialized code model self-hosted.
  • Context length: API models support 128K–1M tokens. Self-hosted models usually top out at 32K–128K (longer contexts need more RAM).
  • Multimodal: GPT-4o and Gemini 1.5 Pro handle image/audio/video. Open-weight multimodal (LLaVA, Qwen-VL) exist but lag.
  • Function calling/tool use: OpenAI/Anthropic have reliable tool-use. Open-weight model tool use works but is less consistent. See [internal: how-ai-agent-memory-works] for impact on agents.

Privacy and data control

Local wins by default.

Public API:

  • Prompts leave your network.
  • Provider's data retention applies (OpenAI keeps inputs 30 days by default).
  • Provider's terms of service apply (sensitive content restrictions).
  • In regulated industries, may be a compliance blocker.

Self-hosted:

  • Prompts stay on your infra.
  • No third-party retention.
  • Full control over data handling.
  • Easier compliance (GDPR/HIPAA).

If you're handling health data, legal docs, or proprietary code, self-hosted may be required.

How to test AI integrations regardless of where the model runs

You can hit https://api.openai.com/v1/chat/completions, http://localhost:11434/api/chat (Ollama), or http://localhost:8080/v1/chat/completions (llama-server) — all are OpenAI-compatible. This lets Apidog Test Scenarios run against any HTTP endpoint.

Example Test Scenario:

{
  "scenario": "Chat completion smoke test",
  "environments": {
    "local": {"base_url": "http://localhost:11434"},
    "production": {"base_url": "https://api.openai.com"}
  },
  "steps": [
    {
      "name": "Basic completion",
      "method": "POST",
      "url": "{{base_url}}/v1/chat/completions",
      "body": {
        "model": "{{model_name}}",
        "messages": [{"role": "user", "content": "Say 'test passed' and nothing else"}],
        "max_tokens": 20
      },
      "assertions": [
        {"field": "status", "operator": "equals", "value": 200},
        {"field": "response.choices[0].message.content", "operator": "contains", "value": "test passed"},
        {"field": "response.usage.total_tokens", "operator": "less_than", "value": 50}
      ]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Run this scenario against your local Ollama instance during development and against the OpenAI API in CI. If your code works locally, it should work with the API. If not, it's usually due to:

  • Model name format (qwen2.5:72b for Ollama, gpt-4o for OpenAI)
  • Function calling response structure (provider differences)
  • Streaming event format (data vs. delta vs. full response)

Apidog's Smart Mock simulates local-model behavior in CI so you don't need a GPU online. Configure a mock that returns OpenAI-compatible responses and test your scenarios. See [internal: how-to-build-tiny-llm-from-scratch] for response structure background.

Setting up a local model server in 10 minutes

To try self-hosted without commitment, use Ollama:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (Gemma 4 12B, fits in 10GB VRAM)
ollama pull gemma4:12b

# Start the server (OpenAI-compatible API on port 11434)
ollama serve

# Test it
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:12b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
Enter fullscreen mode Exit fullscreen mode

For production self-hosting with concurrency, use vLLM:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768
Enter fullscreen mode Exit fullscreen mode

This exposes an OpenAI-compatible API on port 8000. Point Apidog at http://your-server:8000 and run your Test Scenarios.

When to choose each approach

Scenario Local API
High-volume batch (>100K tokens/day) Cheaper Expensive
Privacy-sensitive data Required Risky
Lowest latency on-device Best Not possible
Frontier model capability needed Insufficient Required
Burst workloads, variable traffic Complex to scale Handles automatically
No GPU available Hard Easy
Dev/test environment Great (Ollama) Costs money
Multimodal tasks Limited Full support
Regulated industry compliance Easier Requires DPA

Practical advice:

Use a public API for production (Claude or GPT-4o for quality, Haiku or 4o-mini for high-volume/cheap tasks), and Ollama locally for development/testing. This gives you frontier quality in prod, zero cost in dev, and a consistent API surface.

See [internal: open-source-coding-assistants-2026] for open source coding assistants in the local AI stack.

Conclusion

The local vs. API decision is not binary. It depends on your volume, privacy, latency, and capability needs.

For most AI-powered apps:

  • Start with a public API.
  • Move to self-hosted when your monthly bill >$200-300.
  • Use Ollama locally from day one.
  • Keep your code provider-agnostic by using the OpenAI-compatible API surface everywhere.

Test both environments with Apidog to catch subtle differences before they hit production.

FAQ

What's the minimum GPU for a useful local model?

RTX 3060 (12GB VRAM) runs Qwen2.5-7B or Gemma 4 4B at full quality. RTX 4090 (24GB VRAM) handles most 14B–20B models at INT4 and 34B models at INT2. For 72B models, you need 2x 24GB GPUs or a single A100/H100.

Can I run local AI on Apple Silicon?

Yes. Ollama has native Apple Silicon support and uses the Neural Engine. M3 Pro (18GB unified) runs Qwen2.5-14B comfortably. M4 Max (128GB) handles 70B models.

Is local model output good enough for production?

Depends. For code generation, summarization, and structured data extraction: yes, with a 32B+ model. For complex reasoning or nuanced writing: frontier API models still have an edge.

Do local models support function calling?

Yes, but not as reliably. Llama 3.1, Qwen2.5, and Mistral support tool use, but reliability is lower than GPT-4o/Claude 3.5 Sonnet. Test thoroughly with Apidog Test Scenarios before relying on local model tool use in production. See [internal: claude-code] for details.

How much to self-host a 70B model on AWS?

p4d.24xlarge (8x A100 40GB): $32.77/hour on-demand. Runs a 70B INT8 model with high throughput. g5.2xlarge (1x A10G 24GB) at $1.21/hour runs a 14B INT4 model. Reserved instances are 30–40% cheaper.

What's the difference between Ollama and llama.cpp?

llama.cpp is the inference engine. Ollama wraps llama.cpp with REST API, model management, and a CLI. Use Ollama for dev. Use llama.cpp directly (via llama-server) for more control.

Can I switch between local and API models without changing code?

Yes, if you use an OpenAI-compatible client. In Python:

openai.OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
Enter fullscreen mode Exit fullscreen mode

Connects to Ollama. Change base_url to https://api.openai.com/v1 and update api_key to switch to cloud. Set via environment variables and your code stays the same.

Top comments (0)