DEV Community

Cover image for Best Local LLMs of 2026
Hassann
Hassann

Posted on • Originally published at apidog.com

Best Local LLMs of 2026

This guide helps you choose a local LLM for 2026 based on VRAM, latency, and workload, then serve and test it through an OpenAI-compatible API using Ollama, vLLM, LM Studio, and Apidog.

Try Apidog today

TL;DR

  • The “best” local LLM in 2026 depends on your VRAM budget, latency target, and use case: coding, reasoning, multilingual, or vision.
  • For 24 GB GPUs, Qwen 3.6 32B and DeepSeek V4 Flash are the strongest all-rounders.
  • For 8 GB and below, Gemma 4 9B and Llama 5.1 8B are the practical picks.
  • For reasoning or coding-heavy workloads, use DeepSeek V4 Pro quantized or GLM 5.
  • Use Ollama or LM Studio to expose an OpenAI-compatible HTTP endpoint.
  • Test local models with Apidog the same way you test hosted models.
  • Use Apidog to mock, replay, and benchmark local model traffic without spending hosted LLM tokens.

If you are already focused on DeepSeek, see the DeepSeek V4 local install guide and the DeepSeek V4 overview.

Why local LLMs matter again in 2026

A few years ago, running a local LLM usually meant accepting lower quality. That is less true now.

Open-weight models have narrowed the quality gap with hosted GPT-4-class systems, especially for:

  • Extraction
  • Classification
  • Tool calling
  • Coding assistance
  • Reasoning workflows
  • Structured output generation

The bigger change is hardware. A 24 GB consumer GPU can run a 32B-parameter model at production-quality 4-bit quantization. A Mac Studio with 64 GB unified memory can run DeepSeek V4 Flash at usable speeds.

Local models now make sense when you care about:

  • Data residency
  • Vendor lock-in
  • Predictable inference cost
  • Offline or private workloads
  • Internal tools and CI workflows

The hard part is no longer only “is the model good enough?” It is also:

Can your app call the local model the same way it calls a hosted API?

That is why OpenAI-compatible serving and API testing tools matter.

Selection criteria

This shortlist is not just a leaderboard scrape. The criteria:

  • Open weights with a permissive license such as MIT, Apache 2.0, or a production-friendly community license
  • Active maintenance in 2026
  • OpenAI-compatible serving through Ollama, vLLM, or LM Studio
  • Strong real-world performance in at least one area:
    • General reasoning
    • Code
    • Multilingual output
    • Vision
    • Long context
    • Tool calling
  • Reasonable hardware requirements

The models were tested with the same prompt set on a 4090 and a Mac Studio M3 Ultra, then cross-checked against LMSYS Chatbot Arena and the Hugging Face Open LLM Leaderboard where applicable.

Local LLM picks for 2026

Model Best for Practical hardware target
DeepSeek V4 Pro Reasoning-heavy agents 192 GB unified memory or 2x 80 GB GPUs
DeepSeek V4 Flash General local agent, coding, RAG 24 GB VRAM at Q4
Qwen 3.6 32B Multilingual, structured output, tool calling 24 GB VRAM at Q4
GLM 5.1 Tool-calling agents, extraction, JSON workflows Local serving through Ollama or vLLM
Llama 5.1 8B Smaller local setups 8 GB-class hardware
Gemma 4 9B Lightweight local assistants 8 GB-class hardware

1. DeepSeek V4 Pro

DeepSeek V4 Pro is the flagship model in the DeepSeek V4 release. It is available as 4-bit GGUF and AWQ on Hugging Face.

The full model has:

  • 1.6T total parameters
  • 49B active parameters

That puts it in datacenter-class territory. Quantized to Q4, it fits on:

  • A pair of 80 GB H100s
  • A Mac Studio M3 Ultra with 192 GB unified memory

For most developers, V4 Pro is not the first model to run locally. It is more useful as a reference point for high-end reasoning quality.

If you would rather use the same family through a hosted API, see how to use the DeepSeek V4 API.

Best for: reasoning-heavy agents and high-end local inference.

Hardware: 192 GB unified memory or 2x 80 GB GPUs.

Where to get it: DeepSeek V4 Pro GGUF on Hugging Face.

2. DeepSeek V4 Flash

DeepSeek V4 Flash is the smaller V4 variant:

  • 284B total parameters
  • 13B active parameters
  • Fits in 24 GB VRAM at 4-bit quantization
  • Leaves room for a 64K context window

On a 4090, throughput averages about 28 tokens per second on long-form generation.

DeepSeek V4 Flash

This is the DeepSeek model most teams are likely to run locally. In testing, reasoning quality stayed close to V4 Pro, while coding was slightly behind.

For an end-to-end setup, use the DeepSeek V4 local install guide.

Best for: general-purpose local agents, coding assistants, and RAG generation.

Hardware: 24 GB VRAM at Q4, or 16 GB at Q3 with quality loss.

Where to get it:

ollama pull deepseek-v4-flash
Enter fullscreen mode Exit fullscreen mode

Or use the Hugging Face GGUF.

3. Qwen 3.6 32B

Alibaba’s Qwen models have been one of the most consistent open-weight model families.

Qwen 3.6 32B at Q4 fits in 24 GB VRAM and performs well on:

  • General reasoning
  • Tool calling
  • Structured outputs
  • Multilingual tasks

Its multilingual support is the main reason to choose it over many Western open models. It handles Chinese, Japanese, Korean, and Arabic at a high level.

Qwen 3.6

If your product needs one local model for reasoning plus multilingual output, Qwen 3.6 32B is the most practical pick.

Best for: multilingual products, structured output, tool calling, and balanced cost.

Hardware: 24 GB VRAM at Q4.

Where to get it:

ollama pull qwen3.6:32b
Enter fullscreen mode Exit fullscreen mode

Or use Qwen 3.6 on Hugging Face.

4. GLM 5.1

Zhipu AI’s GLM line has become a strong option for tool-calling and structured workflows.

GLM 5.1 scores near the top among open models on tool-calling benchmarks. Its strongest areas are:

  • Reasoning
  • Classification
  • Structured extraction
  • JSON-mode workflows
  • Instruction following

Coding is weaker than its reasoning and extraction performance.

GLM 5.1

Choose GLM 5.1 when your workload is mostly tool calls, agentic workflows, or JSON schema extraction.

Best for: tool-calling agents, structured extraction, and JSON-mode pipelines.

Serve a local LLM like a hosted API

Once the model is running, your application still needs an HTTP endpoint.

Three serving paths matter.

Option 1: Ollama

Ollama is the easiest path for local development.

Start the server:

ollama serve
Enter fullscreen mode Exit fullscreen mode

Pull a model:

ollama pull qwen3.6:32b
Enter fullscreen mode Exit fullscreen mode

Ollama exposes an OpenAI-compatible endpoint at:

http://localhost:11434/v1
Enter fullscreen mode Exit fullscreen mode

That means most OpenAI SDK-based apps only need two changes:

  • base_url
  • model

Option 2: vLLM

vLLM is the production-oriented option.

Use it when you need:

  • Better throughput
  • Lower latency
  • Continuous batching
  • Higher concurrency

It exposes an OpenAI-compatible API at:

http://localhost:8000/v1
Enter fullscreen mode Exit fullscreen mode

Option 3: LM Studio

LM Studio is useful for individual developers who want a GUI.

Enable the local server in settings, then point your app or API client at the exposed local endpoint.

Minimal Python client example

The OpenAI Python client can call Ollama, vLLM, or LM Studio if the server exposes an OpenAI-compatible API.

from openai import OpenAI

client = OpenAI(
    api_key="ollama",  # any string; Ollama ignores it
    base_url="http://localhost:11434/v1",
)

resp = client.chat.completions.create(
    model="qwen3.6:32b",
    messages=[
        {
            "role": "user",
            "content": "Summarize the differences between MoE and dense models in three bullets."
        }
    ],
    temperature=0.3,
)

print(resp.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

To switch models, change only the model name:

model="deepseek-v4-flash"
Enter fullscreen mode Exit fullscreen mode

or:

model="llama5.1:8b"
Enter fullscreen mode Exit fullscreen mode

The request shape stays the same.

For a related hosted/local workflow, see how to use DeepSeek V4 for free.

Test local models with Apidog

Local inference gives you control, but it also gives you more things to debug.

Testing local models with Apidog

When a hosted provider breaks, you check the status page. When your local model breaks, you own the issue.

You need to inspect:

  • Raw requests
  • Headers
  • Streaming responses
  • Tool-call payloads
  • Token latency
  • Time to first token
  • Output differences between model versions

Apidog treats your Ollama or vLLM endpoint like any other API.

1. Save canonical requests

Create one request collection per model.

Include realistic values for:

  • Prompt
  • System message
  • Temperature
  • max_tokens
  • Tool definitions
  • JSON schema requirements

Replay the same request whenever you change models or quantization levels.

2. Diff outputs across models

Run the same prompt against:

  • Qwen 3.6
  • DeepSeek V4 Flash
  • GLM 5.1
  • Llama 5.1
  • Gemma 4

Then compare responses to spot regressions before shipping.

3. Mock the endpoint for CI

CI should not need a 24 GB GPU to pass.

Use Apidog mocks to return realistic JSON or streaming responses during tests. That keeps unit and integration tests deterministic even when the local model is offline.

4. Benchmark throughput

Track:

  • Latency
  • Time to first token
  • Tokens per second
  • Failure rate
  • Response size

Use those numbers to compare Q4 vs Q5 quantization or Ollama vs vLLM.

5. Document the local API

Apidog projects can export OpenAPI 3.1, so teammates get a clear contract for calling your internal local model endpoint.

For a broader API workflow, see Apidog as a Postman alternative.

Common mistakes when running local LLMs

Picking the biggest model that fits

A 32B model at Q3 can be worse than a 14B model at Q5.

Once you go below 4-bit quantization, quality can drop quickly. Do not compare parameter count without comparing quantization quality.

Forgetting that context length uses VRAM

A long context window is not free.

A 32K-token context on a 32B model needs several GB of KV cache. Reserve memory for context before choosing the model.

Trusting random fine-tunes

Avoid random Hugging Face uploads for production workloads.

Prefer:

  • Original model cards
  • Known fine-tune authors
  • Reproducible evaluation results
  • Clear licenses

A poisoned or poorly trained fine-tune can create security and reliability issues.

Skipping the mock layer

Local models go down.

Common causes:

  • Driver crashes
  • OOM kills
  • GPU throttling
  • Process restarts
  • Broken model downloads

If CI calls the real local model directly, your tests become flaky. Mock the endpoint in Apidog instead.

Ignoring tool-call format differences

Different models can support tool calls but emit slightly different JSON shapes.

Test each model before swapping it into production.

Pay attention to:

  • Function name fields
  • Argument serialization
  • Streaming chunks
  • Invalid JSON recovery
  • Empty tool-call responses

Real-world usage patterns

A startup running a customer-support agent moved from GPT-5.5 to Qwen 3.6 32B on a single 4090. Latency stayed under 800 ms, monthly inference cost dropped, and the team uses Apidog mocks to keep CI deterministic.

A solo developer building a voice assistant runs Gemma 4 9B on an M2 Pro with 16 GB unified memory. Multi-token prediction drafters provide enough throughput for a native-feeling assistant.

A fintech research team runs DeepSeek V4 Flash on two 4090s for nightly batch summarization of regulatory filings. Their cost per summary is mostly electricity and maintenance time.

Implementation checklist

Use this flow to get from model choice to testable local API.

1. Pick the model

For 24 GB VRAM:

Qwen 3.6 32B
DeepSeek V4 Flash
Enter fullscreen mode Exit fullscreen mode

For smaller machines:

Llama 5.1 8B
Gemma 4 9B
Enter fullscreen mode Exit fullscreen mode

For tool-heavy workflows:

GLM 5.1
Qwen 3.6 32B
Enter fullscreen mode Exit fullscreen mode

For high-end reasoning:

DeepSeek V4 Pro
DeepSeek V4 Flash
Enter fullscreen mode Exit fullscreen mode

2. Pull the model

Example with Ollama:

ollama pull qwen3.6:32b
Enter fullscreen mode Exit fullscreen mode

3. Start the local server

ollama serve
Enter fullscreen mode Exit fullscreen mode

4. Test the endpoint

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6:32b",
    "messages": [
      {
        "role": "user",
        "content": "Return three API testing best practices as JSON."
      }
    ],
    "temperature": 0.2
  }'
Enter fullscreen mode Exit fullscreen mode

5. Add the endpoint to Apidog

Use:

http://localhost:11434/v1
Enter fullscreen mode Exit fullscreen mode

Then create saved requests for:

  • Normal chat
  • Streaming chat
  • Tool calling
  • JSON output
  • Long-context prompts
  • Failure cases

6. Replay before every model swap

Before changing from one model to another, replay the same collection and compare:

  • Output structure
  • Latency
  • Tool-call behavior
  • JSON validity
  • Error handling

Conclusion

The best local LLM in 2026 is the one that fits your VRAM, latency budget, and quality bar.

Most teams should start with:

  • Qwen 3.6 32B or DeepSeek V4 Flash for 24 GB GPUs
  • Llama 5.1 8B or Gemma 4 9B for smaller hardware
  • GLM 5.1 when tool calling and structured extraction are the main workload
  • DeepSeek V4 Pro only when you have high-end hardware and need maximum reasoning quality

Five practical takeaways:

  • Local model quality is close enough for many production tasks.
  • Ollama plus an OpenAI-compatible client is the fastest setup path.
  • Quantization quality matters more than raw parameter count.
  • Treat the local model as a production API.
  • Use Apidog to save requests, mock CI, benchmark runs, and document the endpoint.

Next step: pick a model, run ollama pull <name>, and point Apidog at:

http://localhost:11434/v1
Enter fullscreen mode Exit fullscreen mode

You can start replaying and benchmarking requests within an hour.

FAQ

What is the best local LLM for a 24 GB GPU in 2026?

For most workloads, use Qwen 3.6 32B at Q4 or DeepSeek V4 Flash at Q4.

Pick Qwen for multilingual or tool-heavy tasks. Pick DeepSeek V4 Flash for reasoning and coding. See the DeepSeek V4 local guide for setup details.

Can I run a local LLM on a Mac?

Yes. Apple silicon with 16 GB or more unified memory can run Llama 5.1 8B and Gemma 4 9B comfortably.

An M3 Ultra with 192 GB unified memory can run DeepSeek V4 Pro at Q4. Use Ollama or LM Studio.

How do I test a local LLM the same way I test OpenAI?

Point your OpenAI-compatible client and your Apidog project at the local serving URL.

Ollama:

http://localhost:11434/v1
Enter fullscreen mode Exit fullscreen mode

vLLM:

http://localhost:8000/v1
Enter fullscreen mode Exit fullscreen mode

The request shape stays the same. Only the base URL and model name change.

Is local LLM quality really at parity with hosted?

For reasoning, coding, classification, extraction, and tool calling, top open models are often within single-digit percentage points of hosted models.

Hosted models still tend to lead on vision, long-context document QA, and creative writing.

What about cost?

A 4090 can run DeepSeek V4 Flash for the price of electricity and hardware maintenance.

At high volume, hosted inference can cost hundreds or thousands per month. The break-even point depends on utilization, but it is often around millions of tokens per month.

How do I switch a production app between hosted and local?

Keep the OpenAI-compatible client.

Change:

base_url
model
Enter fullscreen mode Exit fullscreen mode

Then replay saved API requests before shipping the swap. See API testing without Postman for the same testing pattern.

Where can I track current model rankings?

Use both:

Cross-reference them because they measure different things.

Top comments (0)