Hassann

Posted on May 26 • Originally published at apidog.com

How to use Local LLMs as APIs ?

Your laptop can expose a local LLM behind the same OpenAI-style API your production code already uses. In practice, you swap one base_url, keep the same SDK calls, and test the same request/response contract against Ollama, vLLM, or llama.cpp. This gives you offline development, zero per-token local test cost, and a private path for sensitive prompts. This guide shows how to choose a runtime, start an OpenAI-compatible endpoint, point your client at it, and validate the flow with Apidog.

Try Apidog today

TL;DR

Run a local LLM API with Ollama, vLLM, or llama.cpp. Each can expose an OpenAI-compatible REST endpoint.

For example, if your current client points to:

https://api.openai.com/v1

you can switch local development to:

http://localhost:11434/v1

Then the same OpenAI SDK code can call a local model such as Llama 3.3, DeepSeek V4, or Qwen 3.6. Use Apidog environments to keep your API scenarios identical across local and hosted targets.

Introduction

Local LLM APIs are now practical for day-to-day development because the API surface has standardized. Most major runtimes now implement the OpenAI /v1/chat/completions shape, so you no longer need separate client code for local and hosted models.

That matters for API developers. If your existing Apidog request points at:

https://api.openai.com/v1/chat/completions

you can parameterize the base URL, switch environments, and send the same request to a model running on your own hardware. No new schema. No new client flow. No rewrite.

If you already track API spend per feature, you can compare hosted and local models with the same test cases and make the trade-off explicit: lower cost and better privacy locally, usually higher latency than hosted APIs.

This walkthrough covers:

Choosing a local runtime
Starting an OpenAI-compatible server
Calling it from Python and JavaScript
Testing the same flow in Apidog
Understanding quantization and GPU offload
Comparing local vs hosted cost and latency

For a broader model overview, see Best local LLMs 2026.

Why local LLMs make sense for API developers

A local LLM API is useful when you need your development environment to behave like production without depending on a remote network call.

Common reasons include:

You need to debug while offline.
Customer networks block egress to hosted AI APIs.
Prompts contain sensitive user data.
You want repeatable model behavior for regression tests.
You want to reduce token spend during development.

Privacy is often the strongest reason. HIPAA, GDPR, and the EU AI Act can treat prompts as user data when they include patient notes, contracts, account details, biometric identifiers, or other sensitive content. Sending that data to a hosted endpoint may create a data-processor relationship you need to document and audit. Running inference on your own hardware can reduce that operational burden.

Cost also compounds quickly. If a team sends tens of millions of prompt tokens per day to a hosted model, development and test traffic can become expensive. Local inference moves that cost to hardware and electricity. You can compare the same arithmetic with your hosted usage; this GPT-5.5 Instant guide provides a related pricing breakdown.

The third reason is stability. Hosted model snapshots can be updated or retired. A local model file stays fixed until you replace it. That helps when your regression suite depends on consistent LLM behavior.

Three runtimes that expose OpenAI-compatible endpoints

Pick the runtime based on your workload and hardware.

Ollama

Ollama is the fastest path for local development. It provides a single CLI, handles model downloads, and runs an HTTP server on port 11434.

Install and run a model:

# install on macOS
brew install ollama

# start the server
ollama serve &

# pull a model
ollama pull llama3.3:70b-instruct-q4_K_M

# run it interactively
ollama run llama3.3:70b-instruct-q4_K_M

The OpenAI-compatible base URL is:

http://localhost:11434/v1

Use Ollama when you want:

Single-machine development
Simple setup
Local demos
CI smoke tests
Apple Silicon support

vLLM

vLLM is designed for higher-throughput serving. It uses PagedAttention and continuous batching to improve performance under concurrent load.

Start an OpenAI-compatible server:

pip install vllm

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192

The base URL is:

http://localhost:8000/v1

Use vLLM when you want:

Shared dev clusters
CUDA or ROCm GPU serving
Concurrent requests
Higher throughput than laptop-oriented runtimes

vLLM is not the right choice for most Apple Silicon laptop workflows.

llama.cpp

llama.cpp is the low-level C++ runtime behind much of the GGUF ecosystem. It runs across a wide range of hardware and exposes an OpenAI-compatible endpoint through llama-server.

Build and run:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j LLAMA_METAL=1

./llama-server -m models/llama-3.3-70b-q4_k_m.gguf \
  --port 8080 \
  --host 0.0.0.0 \
  -c 8192 \
  -ngl 99

The endpoint is:

http://localhost:8080/v1/chat/completions

Use llama.cpp when you need:

Fine-grained quantization control
Memory mapping options
GPU layer offload tuning
Support for constrained or unusual hardware

LM Studio and Jan wrap llama.cpp in a GUI and can also expose OpenAI-compatible endpoints. They are useful when non-terminal users need to test prompts locally.

Verify the local endpoint

Before wiring your app, make a minimal SDK call.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

resp = client.chat.completions.create(
    model="llama3.3:70b-instruct-q4_K_M",
    messages=[
        {"role": "user", "content": "Reply with the word OK only."}
    ],
)

print(resp.choices[0].message.content)

Expected output:

OK

If that works, your runtime, port, model name, and SDK contract are aligned.

Test your local LLM with Apidog

A local LLM API is most useful when your tests can hit it the same way they hit production. In Apidog, use environments to switch only the base URL and API key.

Step 1: Create a local environment

Create an environment named Local.

Add:

BASE_URL=http://localhost:11434/v1
API_KEY=ollama

Step 2: Create a production environment

Clone your existing OpenAI environment and name it Production.

Use:

BASE_URL=https://api.openai.com/v1
API_KEY=<your-hosted-api-key>

Step 3: Parameterize the request

Change the request URL from a hardcoded host to:

{{BASE_URL}}/chat/completions

Set the authorization header to:

Authorization: Bearer {{API_KEY}}

Example request body:

{
  "model": "llama3.3:70b-instruct-q4_K_M",
  "messages": [
    {
      "role": "system",
      "content": "You are a concise API assistant."
    },
    {
      "role": "user",
      "content": "Return a JSON object with status=ok."
    }
  ],
  "temperature": 0.2
}

Step 4: Add scenario assertions

Create a scenario test that sends the request and checks:

choices[0].message.role == "assistant"
choices[0].message.content is not empty
usage.total_tokens > 0

These assertions validate the response contract without depending on exact model wording.

Step 5: Run the same scenario twice

Run once with the Local environment.

Then switch to Production and run again.

The same request and assertions should pass for both environments. This gives you a reusable smoke test for local runtime upgrades, hosted model changes, and client-side contract drift.

The same pattern also applies to testing AI agents that call multi-step APIs.

Wire the local model into application code

Python

Use one function to choose the target environment:

import os
from openai import OpenAI


def get_client():
    if os.getenv("ENV") == "local":
        return OpenAI(
            base_url="http://localhost:11434/v1",
            api_key="ollama",
        )

    return OpenAI(
        api_key=os.environ["OPENAI_API_KEY"]
    )


client = get_client()

response = client.chat.completions.create(
    model=os.getenv("MODEL", "llama3.3:70b-instruct-q4_K_M"),
    messages=[
        {"role": "system", "content": "You are a JSON-only assistant."},
        {"role": "user", "content": "Return {\"status\": \"ok\"}."},
    ],
    response_format={"type": "json_object"},
)

print(response.choices[0].message.content)

Run locally:

ENV=local MODEL=llama3.3:70b-instruct-q4_K_M python app.py

Run against hosted OpenAI:

ENV=production OPENAI_API_KEY=sk-... MODEL=gpt-... python app.py

JavaScript

import OpenAI from "openai";

const isLocal = process.env.ENV === "local";

const client = new OpenAI({
  baseURL: isLocal
    ? "http://localhost:11434/v1"
    : "https://api.openai.com/v1",
  apiKey: isLocal ? "ollama" : process.env.OPENAI_API_KEY,
});

const resp = await client.chat.completions.create({
  model: process.env.MODEL || "llama3.3:70b-instruct-q4_K_M",
  messages: [
    {
      role: "user",
      content: "Say hi.",
    },
  ],
});

console.log(resp.choices[0].message.content);

Run locally:

ENV=local MODEL=llama3.3:70b-instruct-q4_K_M node app.js

Add the scenario to CI

After you validate the request manually, export the Apidog project as an apidog-cli collection and run it in CI.

Example GitHub Actions shape:

name: API contract tests

on:
  pull_request:
  push:
    branches: [main]

jobs:
  test-api-contract:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Install Apidog CLI
        run: npm install -g apidog-cli

      - name: Run Apidog scenarios
        run: apidog run ./apidog-collection.json

If an assertion fails, the command exits non-zero and the build fails.

QA teams can wire the same flow into existing API testing pipelines.

Advanced techniques and pro tips

Choose the right quantization

Quantization decides whether a large model fits on your machine.

GGUF models commonly ship in 8-bit, 6-bit, 5-bit, 4-bit, 3-bit, and 2-bit variants.

Practical defaults:

Quantization	Use case
`Q8`	Better quality, higher RAM and disk use
`Q5_K_M`	Good quality if you have extra memory
`Q4_K_M`	Strong default for chat workloads
`Q2_K`	Smaller footprint, larger quality loss

For most local chat testing, start with Q4_K_M. For code generation or stricter output quality, try Q5_K_M or Q8 if your hardware can handle it.

Tune GPU offload

In llama.cpp, -ngl controls how many transformer layers are offloaded to GPU:

./llama-server -m model.gguf -ngl 99

In Ollama, GPU behavior is controlled through model/runtime configuration.

Set GPU offload as high as your VRAM allows. Layers that fall back to CPU reduce throughput.

Keep memory mapping enabled

llama.cpp and Ollama use memory mapping by default. This lets the OS page model weights in as needed instead of allocating the full model at startup.

Keep mmap enabled unless your container or deployment environment has strict memory behavior that requires otherwise.

Use batching with vLLM

Batching is where vLLM performs best. With concurrent requests, vLLM groups work into efficient GPU passes.

Example:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --max-num-seqs 64

For larger GPUs, increase the sequence count based on available memory and workload.

Stream responses

Streaming reduces perceived latency because the client receives tokens as they are generated.

Python example:

stream = client.chat.completions.create(
    model="llama3.3:70b-instruct-q4_K_M",
    messages=[{"role": "user", "content": "Explain local LLM APIs."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="")

All runtimes discussed here support streaming through the OpenAI-compatible API shape.

Use an Ollama Modelfile

A Modelfile lets you package defaults such as system prompts, temperature, and stop sequences.

Example Modelfile:

FROM llama3.3:70b-instruct-q4_K_M

SYSTEM """
You are a concise API assistant.
Return implementation-focused answers.
"""

PARAMETER temperature 0.2
PARAMETER stop "</response>"

Create the model:

ollama create my-assistant -f Modelfile

Then call:

response = client.chat.completions.create(
    model="my-assistant",
    messages=[{"role": "user", "content": "Generate a curl example."}],
)

Common mistakes

Avoid these when moving between hosted and local LLM APIs:

Hardcoding http://localhost:11434 in application code. Use an environment variable.
Assuming all local runtimes enforce max_tokens the same way. Set explicit limits and stop sequences.
Running multiple runtimes on the same port.
Omitting the Authorization header. Ollama may ignore it, but vLLM can reject requests when --api-key is enabled.
Expecting heavily quantized local models to match hosted frontier models on reasoning-heavy tasks.
Testing only the happy path. Add assertions for error responses and malformed outputs.

Local vs hosted: cost and latency math

The table below compares local inference on an M3 Max with 128 GB unified memory against hosted equivalents. Time to first token is measured cold, with no batching, on a 1,024-token prompt.

Model	Local TTFT	Local throughput	Hosted equivalent	Hosted price	Hosted TTFT
Llama 3.3 70B Q4_K_M	1.2 s	12 tok/s	GPT-5.5 Instant	$5 / $30 per 1M	200 ms
DeepSeek V4 67B Q4_K_M	1.4 s	10 tok/s	DeepSeek-Chat hosted	$0.55 / $2.20 per 1M	280 ms
Qwen 3.6 32B Q5_K_M	0.7 s	28 tok/s	Qwen-Max hosted	$1.60 / $6.40 per 1M	240 ms
Gemma 4 27B Q4_K_M	0.5 s	35 tok/s	Gemini 3 Flash	$0.35 / $1.05 per 1M	180 ms

Hosted APIs usually win on latency. Local APIs win on privacy immediately and can win on cost once development or internal traffic becomes large enough.

A practical deployment pattern:

Use local models during the inner development loop.
Use hosted models in staging and production when latency matters.
Keep both targets covered by the same Apidog scenario tests.
Switch with environment variables, not code branches.

For model-specific walkthroughs, see How to run DeepSeek V4 locally and the DeepSeek V4 usage guide.

Real-world use cases

Compliance-heavy development

A fintech compliance team can use Ollama on engineer laptops to draft suspicious activity report prototypes without sending account numbers or transaction patterns to a hosted provider. Production can still use a hosted model with a redacted prompt.

Apidog scenarios can assert that the redaction step runs before any request leaves the local environment.

Prompt engineering training

A game studio can run a local Qwen model for internal prompt training. Interns can test workflows offline without exposing unreleased game lore to a third-party endpoint.

The same application can later use Gemini 3 Flash in production by changing only the environment. For production wiring, see the Gemini 3 Flash API guide.

Private network inference

A healthcare startup can run vLLM on a GPU server inside a hospital network. The endpoint stays off public DNS, while developers still use the OpenAI SDK and the same contract tests they use locally.

Conclusion

Local LLM APIs are now straightforward to integrate because they can mimic the OpenAI API shape. The implementation path is simple:

Pick Ollama for laptops, vLLM for shared GPU serving, or llama.cpp for tight hardware control.
Start the OpenAI-compatible endpoint.
Verify it with a minimal SDK request.
Move base_url and api_key into environment variables.
Build Apidog scenarios that run against both local and hosted environments.

Use Apidog to keep those contracts testable as you switch models and runtimes. If you have not picked a model yet, start with Best local LLMs 2026. For agent workflows, read How to test AI agents API.

DEV Community