Your laptop can expose a local LLM behind the same OpenAI-style API your production code already uses. In practice, you swap one base_url, keep the same SDK calls, and test the same request/response contract against Ollama, vLLM, or llama.cpp. This gives you offline development, zero per-token local test cost, and a private path for sensitive prompts. This guide shows how to choose a runtime, start an OpenAI-compatible endpoint, point your client at it, and validate the flow with Apidog.
TL;DR
Run a local LLM API with Ollama, vLLM, or llama.cpp. Each can expose an OpenAI-compatible REST endpoint.
For example, if your current client points to:
https://api.openai.com/v1
you can switch local development to:
http://localhost:11434/v1
Then the same OpenAI SDK code can call a local model such as Llama 3.3, DeepSeek V4, or Qwen 3.6. Use Apidog environments to keep your API scenarios identical across local and hosted targets.
Introduction
Local LLM APIs are now practical for day-to-day development because the API surface has standardized. Most major runtimes now implement the OpenAI /v1/chat/completions shape, so you no longer need separate client code for local and hosted models.
That matters for API developers. If your existing Apidog request points at:
https://api.openai.com/v1/chat/completions
you can parameterize the base URL, switch environments, and send the same request to a model running on your own hardware. No new schema. No new client flow. No rewrite.
If you already track API spend per feature, you can compare hosted and local models with the same test cases and make the trade-off explicit: lower cost and better privacy locally, usually higher latency than hosted APIs.
This walkthrough covers:
- Choosing a local runtime
- Starting an OpenAI-compatible server
- Calling it from Python and JavaScript
- Testing the same flow in Apidog
- Understanding quantization and GPU offload
- Comparing local vs hosted cost and latency
For a broader model overview, see Best local LLMs 2026.
Why local LLMs make sense for API developers
A local LLM API is useful when you need your development environment to behave like production without depending on a remote network call.
Common reasons include:
- You need to debug while offline.
- Customer networks block egress to hosted AI APIs.
- Prompts contain sensitive user data.
- You want repeatable model behavior for regression tests.
- You want to reduce token spend during development.
Privacy is often the strongest reason. HIPAA, GDPR, and the EU AI Act can treat prompts as user data when they include patient notes, contracts, account details, biometric identifiers, or other sensitive content. Sending that data to a hosted endpoint may create a data-processor relationship you need to document and audit. Running inference on your own hardware can reduce that operational burden.
Cost also compounds quickly. If a team sends tens of millions of prompt tokens per day to a hosted model, development and test traffic can become expensive. Local inference moves that cost to hardware and electricity. You can compare the same arithmetic with your hosted usage; this GPT-5.5 Instant guide provides a related pricing breakdown.
The third reason is stability. Hosted model snapshots can be updated or retired. A local model file stays fixed until you replace it. That helps when your regression suite depends on consistent LLM behavior.
Three runtimes that expose OpenAI-compatible endpoints
Pick the runtime based on your workload and hardware.
Ollama
Ollama is the fastest path for local development. It provides a single CLI, handles model downloads, and runs an HTTP server on port 11434.
Install and run a model:
# install on macOS
brew install ollama
# start the server
ollama serve &
# pull a model
ollama pull llama3.3:70b-instruct-q4_K_M
# run it interactively
ollama run llama3.3:70b-instruct-q4_K_M
The OpenAI-compatible base URL is:
http://localhost:11434/v1
Use Ollama when you want:
- Single-machine development
- Simple setup
- Local demos
- CI smoke tests
- Apple Silicon support
vLLM
vLLM is designed for higher-throughput serving. It uses PagedAttention and continuous batching to improve performance under concurrent load.
Start an OpenAI-compatible server:
pip install vllm
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192
The base URL is:
http://localhost:8000/v1
Use vLLM when you want:
- Shared dev clusters
- CUDA or ROCm GPU serving
- Concurrent requests
- Higher throughput than laptop-oriented runtimes
vLLM is not the right choice for most Apple Silicon laptop workflows.
llama.cpp
llama.cpp is the low-level C++ runtime behind much of the GGUF ecosystem. It runs across a wide range of hardware and exposes an OpenAI-compatible endpoint through llama-server.
Build and run:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j LLAMA_METAL=1
./llama-server -m models/llama-3.3-70b-q4_k_m.gguf \
--port 8080 \
--host 0.0.0.0 \
-c 8192 \
-ngl 99
The endpoint is:
http://localhost:8080/v1/chat/completions
Use llama.cpp when you need:
- Fine-grained quantization control
- Memory mapping options
- GPU layer offload tuning
- Support for constrained or unusual hardware
LM Studio and Jan wrap llama.cpp in a GUI and can also expose OpenAI-compatible endpoints. They are useful when non-terminal users need to test prompts locally.
Verify the local endpoint
Before wiring your app, make a minimal SDK call.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
resp = client.chat.completions.create(
model="llama3.3:70b-instruct-q4_K_M",
messages=[
{"role": "user", "content": "Reply with the word OK only."}
],
)
print(resp.choices[0].message.content)
Expected output:
OK
If that works, your runtime, port, model name, and SDK contract are aligned.
Test your local LLM with Apidog
A local LLM API is most useful when your tests can hit it the same way they hit production. In Apidog, use environments to switch only the base URL and API key.
Step 1: Create a local environment
Create an environment named Local.
Add:
BASE_URL=http://localhost:11434/v1
API_KEY=ollama
Step 2: Create a production environment
Clone your existing OpenAI environment and name it Production.
Use:
BASE_URL=https://api.openai.com/v1
API_KEY=<your-hosted-api-key>
Step 3: Parameterize the request
Change the request URL from a hardcoded host to:
{{BASE_URL}}/chat/completions
Set the authorization header to:
Authorization: Bearer {{API_KEY}}
Example request body:
{
"model": "llama3.3:70b-instruct-q4_K_M",
"messages": [
{
"role": "system",
"content": "You are a concise API assistant."
},
{
"role": "user",
"content": "Return a JSON object with status=ok."
}
],
"temperature": 0.2
}
Step 4: Add scenario assertions
Create a scenario test that sends the request and checks:
choices[0].message.role == "assistant"
choices[0].message.content is not empty
usage.total_tokens > 0
These assertions validate the response contract without depending on exact model wording.
Step 5: Run the same scenario twice
Run once with the Local environment.
Then switch to Production and run again.
The same request and assertions should pass for both environments. This gives you a reusable smoke test for local runtime upgrades, hosted model changes, and client-side contract drift.
The same pattern also applies to testing AI agents that call multi-step APIs.
Wire the local model into application code
Python
Use one function to choose the target environment:
import os
from openai import OpenAI
def get_client():
if os.getenv("ENV") == "local":
return OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
return OpenAI(
api_key=os.environ["OPENAI_API_KEY"]
)
client = get_client()
response = client.chat.completions.create(
model=os.getenv("MODEL", "llama3.3:70b-instruct-q4_K_M"),
messages=[
{"role": "system", "content": "You are a JSON-only assistant."},
{"role": "user", "content": "Return {\"status\": \"ok\"}."},
],
response_format={"type": "json_object"},
)
print(response.choices[0].message.content)
Run locally:
ENV=local MODEL=llama3.3:70b-instruct-q4_K_M python app.py
Run against hosted OpenAI:
ENV=production OPENAI_API_KEY=sk-... MODEL=gpt-... python app.py
JavaScript
import OpenAI from "openai";
const isLocal = process.env.ENV === "local";
const client = new OpenAI({
baseURL: isLocal
? "http://localhost:11434/v1"
: "https://api.openai.com/v1",
apiKey: isLocal ? "ollama" : process.env.OPENAI_API_KEY,
});
const resp = await client.chat.completions.create({
model: process.env.MODEL || "llama3.3:70b-instruct-q4_K_M",
messages: [
{
role: "user",
content: "Say hi.",
},
],
});
console.log(resp.choices[0].message.content);
Run locally:
ENV=local MODEL=llama3.3:70b-instruct-q4_K_M node app.js
Add the scenario to CI
After you validate the request manually, export the Apidog project as an apidog-cli collection and run it in CI.
Example GitHub Actions shape:
name: API contract tests
on:
pull_request:
push:
branches: [main]
jobs:
test-api-contract:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Apidog CLI
run: npm install -g apidog-cli
- name: Run Apidog scenarios
run: apidog run ./apidog-collection.json
If an assertion fails, the command exits non-zero and the build fails.
QA teams can wire the same flow into existing API testing pipelines.
Advanced techniques and pro tips
Choose the right quantization
Quantization decides whether a large model fits on your machine.
GGUF models commonly ship in 8-bit, 6-bit, 5-bit, 4-bit, 3-bit, and 2-bit variants.
Practical defaults:
| Quantization | Use case |
|---|---|
Q8 |
Better quality, higher RAM and disk use |
Q5_K_M |
Good quality if you have extra memory |
Q4_K_M |
Strong default for chat workloads |
Q2_K |
Smaller footprint, larger quality loss |
For most local chat testing, start with Q4_K_M. For code generation or stricter output quality, try Q5_K_M or Q8 if your hardware can handle it.
Tune GPU offload
In llama.cpp, -ngl controls how many transformer layers are offloaded to GPU:
./llama-server -m model.gguf -ngl 99
In Ollama, GPU behavior is controlled through model/runtime configuration.
Set GPU offload as high as your VRAM allows. Layers that fall back to CPU reduce throughput.
Keep memory mapping enabled
llama.cpp and Ollama use memory mapping by default. This lets the OS page model weights in as needed instead of allocating the full model at startup.
Keep mmap enabled unless your container or deployment environment has strict memory behavior that requires otherwise.
Use batching with vLLM
Batching is where vLLM performs best. With concurrent requests, vLLM groups work into efficient GPU passes.
Example:
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--max-num-seqs 64
For larger GPUs, increase the sequence count based on available memory and workload.
Stream responses
Streaming reduces perceived latency because the client receives tokens as they are generated.
Python example:
stream = client.chat.completions.create(
model="llama3.3:70b-instruct-q4_K_M",
messages=[{"role": "user", "content": "Explain local LLM APIs."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="")
All runtimes discussed here support streaming through the OpenAI-compatible API shape.
Use an Ollama Modelfile
A Modelfile lets you package defaults such as system prompts, temperature, and stop sequences.
Example Modelfile:
FROM llama3.3:70b-instruct-q4_K_M
SYSTEM """
You are a concise API assistant.
Return implementation-focused answers.
"""
PARAMETER temperature 0.2
PARAMETER stop "</response>"
Create the model:
ollama create my-assistant -f Modelfile
Then call:
response = client.chat.completions.create(
model="my-assistant",
messages=[{"role": "user", "content": "Generate a curl example."}],
)
Common mistakes
Avoid these when moving between hosted and local LLM APIs:
- Hardcoding
http://localhost:11434in application code. Use an environment variable. - Assuming all local runtimes enforce
max_tokensthe same way. Set explicit limits and stop sequences. - Running multiple runtimes on the same port.
- Omitting the
Authorizationheader. Ollama may ignore it, but vLLM can reject requests when--api-keyis enabled. - Expecting heavily quantized local models to match hosted frontier models on reasoning-heavy tasks.
- Testing only the happy path. Add assertions for error responses and malformed outputs.
Local vs hosted: cost and latency math
The table below compares local inference on an M3 Max with 128 GB unified memory against hosted equivalents. Time to first token is measured cold, with no batching, on a 1,024-token prompt.
| Model | Local TTFT | Local throughput | Hosted equivalent | Hosted price | Hosted TTFT |
|---|---|---|---|---|---|
| Llama 3.3 70B Q4_K_M | 1.2 s | 12 tok/s | GPT-5.5 Instant | $5 / $30 per 1M | 200 ms |
| DeepSeek V4 67B Q4_K_M | 1.4 s | 10 tok/s | DeepSeek-Chat hosted | $0.55 / $2.20 per 1M | 280 ms |
| Qwen 3.6 32B Q5_K_M | 0.7 s | 28 tok/s | Qwen-Max hosted | $1.60 / $6.40 per 1M | 240 ms |
| Gemma 4 27B Q4_K_M | 0.5 s | 35 tok/s | Gemini 3 Flash | $0.35 / $1.05 per 1M | 180 ms |
Hosted APIs usually win on latency. Local APIs win on privacy immediately and can win on cost once development or internal traffic becomes large enough.
A practical deployment pattern:
- Use local models during the inner development loop.
- Use hosted models in staging and production when latency matters.
- Keep both targets covered by the same Apidog scenario tests.
- Switch with environment variables, not code branches.
For model-specific walkthroughs, see How to run DeepSeek V4 locally and the DeepSeek V4 usage guide.
Real-world use cases
Compliance-heavy development
A fintech compliance team can use Ollama on engineer laptops to draft suspicious activity report prototypes without sending account numbers or transaction patterns to a hosted provider. Production can still use a hosted model with a redacted prompt.
Apidog scenarios can assert that the redaction step runs before any request leaves the local environment.
Prompt engineering training
A game studio can run a local Qwen model for internal prompt training. Interns can test workflows offline without exposing unreleased game lore to a third-party endpoint.
The same application can later use Gemini 3 Flash in production by changing only the environment. For production wiring, see the Gemini 3 Flash API guide.
Private network inference
A healthcare startup can run vLLM on a GPU server inside a hospital network. The endpoint stays off public DNS, while developers still use the OpenAI SDK and the same contract tests they use locally.
Conclusion
Local LLM APIs are now straightforward to integrate because they can mimic the OpenAI API shape. The implementation path is simple:
- Pick Ollama for laptops, vLLM for shared GPU serving, or llama.cpp for tight hardware control.
- Start the OpenAI-compatible endpoint.
- Verify it with a minimal SDK request.
- Move
base_urlandapi_keyinto environment variables. - Build Apidog scenarios that run against both local and hosted environments.
Use Apidog to keep those contracts testable as you switch models and runtimes. If you have not picked a model yet, start with Best local LLMs 2026. For agent workflows, read How to test AI agents API.




Top comments (0)