DEV Community

Preecha
Preecha

Posted on

Running AI models locally vs. via API: which should you choose?

TL;DR

Local AI runs on your hardware, costs nothing per request, and keeps data private. API-based AI is faster to start, more capable, and scales without infrastructure. Most teams need both. This guide compares cost, latency, capability, privacy, and testing workflows so you can choose the right setup.

Try Apidog today

Introduction

Gemma 4 running natively on an iPhone. A browser extension that embeds a full language model without an API key. These were not practical for most developers 18 months ago. Today, local AI is becoming a real deployment option.

The old default was simple: use a frontier API model, because local models were too weak to matter. That has changed. Local models like Qwen2.5-72B, Gemma 4, and DeepSeek-V3 now compete on many real benchmarks. Developers who previously defaulted to OpenAI-style APIs are reconsidering, especially for privacy-sensitive applications or high-volume workloads where token costs compound quickly.

This guide focuses on implementation tradeoffs: cost, latency, capability, privacy, and how to test AI integrations consistently whether the model runs locally or in the cloud.

If you are testing AI API integrations, Apidog Test Scenarios work with both local and cloud models. You can point the same scenario at a local llama-server endpoint or at OpenAI's /v1/chat/completions endpoint and run the same assertions. See [internal: api-testing-tutorial] for the baseline testing approach.

What "running AI locally" means

Local AI is not one deployment model. There are three common setups.

1. On-device inference

The model runs entirely on the user device, with no server involved.

Examples:

  • Gemma running in a browser tab
  • Gemma 4 on an iPhone Neural Engine
  • An Ollama model running on a MacBook

After the model is downloaded, internet access is not required.

2. Self-hosted server

You run the model on hardware you control and expose an API.

That hardware might be:

  • A workstation
  • A cloud VM
  • An on-prem server
  • A dedicated GPU box

Common tools:

  • Ollama
  • llama-server
  • vLLM

The model is not running on the end user's device, but it is also not running at OpenAI, Anthropic, or Google.

3. Private cloud

You deploy a model on cloud infrastructure you control.

Examples:

  • AWS Bedrock custom models
  • Azure private endpoints
  • GCP Vertex AI custom models

This gives you more control than a public API and less operational burden than fully self-hosting.

This article focuses mostly on self-hosted vs. public API, because that is the decision most developers face.

Cost comparison

Local AI usually wins on cost for high-volume workloads.

Public API pricing, as of April 2026:

Model Input, per 1M tokens Output, per 1M tokens
GPT-4o $2.50 $10.00
Claude 3.5 Sonnet $3.00 $15.00
Gemini 1.5 Pro $1.25 $5.00
GPT-4o mini $0.15 $0.60
Claude 3 Haiku $0.25 $1.25

Self-hosted example: Qwen2.5-72B on A100

Assume:

  • Model: Qwen2.5-72B
  • Quantization: INT4
  • GPU: single A100 80GB
  • Cloud GPU price: about $1.99/hour
  • Throughput: about 200 tokens/second

At 200 tokens/second with full utilization:

200 tokens/sec * 3600 sec = 720,000 tokens/hour
$1.99 / 720,000 = ~$0.0028 per 1K tokens
Enter fullscreen mode Exit fullscreen mode

That cost includes both input and output tokens.

For comparison, GPT-4o charges about $0.01 per 1K output tokens alone.

Break-even point

If you process more than roughly 70K output tokens per day consistently, self-hosting can beat GPT-4o on cost.

Below that, the API is usually cheaper because you are not paying for idle GPU time.

Smaller model example

A 4-bit quantized Gemma 4 12B model can run on a single RTX 4090.

Assume equivalent cloud GPU time costs about $0.40/hour.

In that case, self-hosting can break even against GPT-4o mini at roughly 15K output tokens/day.

Latency comparison

Latency depends on where the model runs and how much concurrency you need.

Time to first token

For a 72B model on a dedicated A100 with a 1K-token prompt:

TTFT: ~800ms to 1.5s
Enter fullscreen mode Exit fullscreen mode

For OpenAI's API under normal load with similar inputs:

TTFT: ~300ms to 800ms
Enter fullscreen mode Exit fullscreen mode

For on-device inference on iPhone Neural Engine or Apple Silicon:

TTFT: ~200ms to 400ms
Enter fullscreen mode Exit fullscreen mode

On-device inference can win because there is no network round trip.

Throughput

A single A100 running a 72B INT4 model can serve one user well. Under concurrent load, performance degrades unless you use batching.

For production self-hosting, use a server designed for concurrency, such as vLLM.

Public APIs handle concurrency and burst traffic for you.

Streaming

Both local and API-based models can stream responses.

Local streaming avoids network jitter. API streaming depends on provider performance and network conditions.

Latency summary

Requirement Best fit
Lowest possible latency on one device On-device
High throughput with controlled infrastructure Self-hosted with batching
Burst capacity without infrastructure work Public API

Capability comparison

Public APIs still lead for the most demanding workloads.

Reasoning and complex tasks

GPT-4o and Claude 3.5 Sonnet remain ahead of open-weight models on benchmarks such as:

  • MMLU
  • HumanEval
  • Complex multi-step reasoning tasks

The gap has narrowed with models like Qwen2.5-72B and DeepSeek-V3, but it still exists.

Code generation

This is closer.

Models like DeepSeek-Coder-V2 and Qwen2.5-Coder-32B match GPT-4o on many code benchmarks. For code-specific workloads, a specialized local code model can be a better choice than a general-purpose model.

Context length

Frontier API models support very large context windows, often in the 128K to 1M token range.

Most self-hosted models are practical around 32K to 128K tokens. Longer contexts require proportionally more memory.

Multimodal support

API models such as GPT-4o and Gemini 1.5 Pro support image, audio, and video inputs.

Open-weight multimodal models exist, including LLaVA and Qwen-VL, but they generally lag behind frontier API models.

Function calling and tool use

OpenAI and Anthropic currently provide the most reliable tool-use behavior.

Open-weight models can support tool use, but complex tool chains are less consistent. See [internal: how-ai-agent-memory-works] for how this affects agent architectures.

Privacy and data control

Local AI wins clearly when data control matters.

With a public API

Your application sends prompts to a third-party provider.

That means:

  • Prompts leave your network
  • The provider's data retention policy applies
  • OpenAI retains inputs for 30 days by default unless you opt out via API
  • Sensitive content is subject to the provider's terms of service
  • Regulated workloads may require additional legal and compliance review

For healthcare, finance, legal, or proprietary-code workloads, this may be a blocker.

With a self-hosted model

Prompts stay inside your infrastructure.

You control:

  • Data retention
  • Network boundaries
  • Logging
  • Access policies
  • Which content the model can process

For applications handling personal health data, legal documents, or proprietary source code, self-hosting may be required.

How to test AI integrations regardless of where the model runs

Many local model servers expose an OpenAI-compatible API.

Examples:

https://api.openai.com/v1/chat/completions
http://localhost:11434/api/chat
http://localhost:11434/v1/chat/completions
http://localhost:8080/v1/chat/completions
Enter fullscreen mode Exit fullscreen mode

That compatibility matters because the same HTTP tests can run against local and cloud environments.

Here is a simplified Apidog Test Scenario structure:

{
  "scenario": "Chat completion smoke test",
  "environments": {
    "local": {
      "base_url": "http://localhost:11434"
    },
    "production": {
      "base_url": "https://api.openai.com"
    }
  },
  "steps": [
    {
      "name": "Basic completion",
      "method": "POST",
      "url": "{{base_url}}/v1/chat/completions",
      "body": {
        "model": "{{model_name}}",
        "messages": [
          {
            "role": "user",
            "content": "Say 'test passed' and nothing else"
          }
        ],
        "max_tokens": 20
      },
      "assertions": [
        {
          "field": "status",
          "operator": "equals",
          "value": 200
        },
        {
          "field": "response.choices[0].message.content",
          "operator": "contains",
          "value": "test passed"
        },
        {
          "field": "response.usage.total_tokens",
          "operator": "less_than",
          "value": 50
        }
      ]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Run the scenario against Ollama during development and against OpenAI in CI.

If the same client code does not work in both places, check these differences first:

  • Model name format
    • Ollama: qwen2.5:72b
    • OpenAI: gpt-4o
  • Function calling response structure
  • Streaming event format
  • Token usage fields
  • Error response shape

Apidog Smart Mock can also simulate local-model behavior in CI without keeping a GPU online. Configure a mock that returns valid OpenAI-compatible responses, then run your Test Scenarios against that mock.

See [internal: how-to-build-tiny-llm-from-scratch] for background on why response structures differ at the model level.

Setting up a local model server in 10 minutes

Ollama is the fastest way to test local inference.

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Pull a model

Example with Gemma 4 12B:

ollama pull gemma4:12b
Enter fullscreen mode Exit fullscreen mode

Start the server

ollama serve
Enter fullscreen mode Exit fullscreen mode

Ollama exposes an API on port 11434.

Test the local endpoint

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:12b",
    "messages": [
      {
        "role": "user",
        "content": "Hello"
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Production self-hosting with vLLM

For multi-user concurrency, vLLM is a better production option.

Install it:

pip install vllm
Enter fullscreen mode Exit fullscreen mode

Start an OpenAI-compatible server:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768
Enter fullscreen mode Exit fullscreen mode

This exposes an OpenAI-compatible API on port 8000.

You can then point your test client or Apidog environment at:

http://your-server:8000
Enter fullscreen mode Exit fullscreen mode

When to choose local AI vs. API AI

Scenario Local API
High-volume batch processing, over 100K tokens/day Cheaper Expensive
Privacy-sensitive data, such as health, legal, finance Required Risky
Lowest latency on-device Best Not possible
Frontier model capability needed Insufficient Required
Burst workloads with variable traffic Complex to scale Handles automatically
No GPU available Hard Easy
Dev/test environment Great with Ollama Costs money
Multimodal tasks Limited Full support
Regulated industry compliance Easier Requires DPA

For many teams, the practical architecture is hybrid:

  • Use a public API in production for quality-sensitive workloads
  • Use cheaper API models for high-volume simple tasks
  • Use Ollama locally for development and testing
  • Move to self-hosting when your monthly API bill justifies the GPU cost
  • Keep the API surface OpenAI-compatible so switching providers is easier

See [internal: open-source-coding-assistants-2026] for how open source coding assistants fit into the local AI workflow.

Conclusion

The local vs. API decision is not binary.

Choose based on:

  • Token volume
  • Privacy requirements
  • Latency requirements
  • Model capability needs
  • Operational capacity
  • Compliance constraints

A practical default for most developers:

  1. Start with a public API.
  2. Use Ollama locally from day one.
  3. Keep your code provider-agnostic with OpenAI-compatible clients.
  4. Move high-volume or sensitive workloads to self-hosting when the cost or privacy case is clear.
  5. Test both environments consistently to catch behavior differences before production.

FAQ

What's the minimum GPU to run a useful local model?

An RTX 3060 with 12GB VRAM can run Qwen2.5-7B or Gemma 4 4B at full quality.

An RTX 4090 with 24GB VRAM can handle many 14B to 20B models at INT4 quantization and some 34B models at INT2.

For 72B models, you usually need either two 24GB GPUs or a single A100/H100-class GPU.

Can I run local AI on Apple Silicon?

Yes. Ollama has native Apple Silicon support and uses Apple hardware acceleration.

An M3 Pro with 18GB unified memory can run Qwen2.5-14B comfortably. An M4 Max with 128GB unified memory can handle 70B models.

Is local model output quality good enough for production?

It depends on the task.

Local models can work well for:

  • Code generation
  • Summarization
  • Structured data extraction
  • Classification
  • Internal automation

For complex reasoning, nuanced writing, or tasks requiring strong world knowledge, frontier API models still have a clear edge.

Do local models support function calling?

Yes, but reliability varies.

Models such as Llama 3.1, Qwen2.5, and Mistral support tool use. However, they are generally less reliable than GPT-4o or Claude 3.5 Sonnet on complex tool chains.

Test thoroughly before relying on local model tool use in production. See [internal: claude-code] for how frontier models handle tool use in coding contexts.

How much does it cost to self-host a 70B model on AWS?

A p4d.24xlarge instance with 8x A100 40GB GPUs costs about $32.77/hour on demand. It can run a 70B INT8 model with high throughput.

A g5.2xlarge instance with 1x A10G 24GB costs about $1.21/hour and can run a 14B INT4 model for lighter workloads.

Reserved instances can reduce these costs by roughly 30-40%.

What's the difference between Ollama and llama.cpp?

llama.cpp is the underlying inference engine.

Ollama wraps it with:

  • A REST API
  • Model management
  • pull, list, and delete commands
  • A simple CLI

Use Ollama for development. Use llama.cpp directly through llama-server if you need more control over quantization formats or hardware configuration.

Can I switch between local and API models without changing my code?

Yes, if you use an OpenAI-compatible client.

Example in Python:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="gemma4:12b",
    messages=[
        {"role": "user", "content": "Hello"}
    ]
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

To switch to OpenAI, change the environment configuration:

client = OpenAI(
    base_url="https://api.openai.com/v1",
    api_key=os.environ["OPENAI_API_KEY"]
)
Enter fullscreen mode Exit fullscreen mode

Set base_url, api_key, and model through environment variables so your application code stays the same.

Top comments (0)