Vijaya Rajeev Bollu

Posted on Jun 5

I Benchmarked 3 Local LLMs on My Laptop — Here's What the Numbers Actually Show

#python #ai #ollama #machinelearning

The Problem With Choosing a Local Model

Everyone has an opinion on which local LLM is best.

"Use Llama — it's the most popular." "Mistral 7B has the best quality." "Phi-3 Mini is small and efficient."

None of these claims come with numbers. Specifically: your numbers, on your hardware, for your workload.

I built a benchmarking system to change that. Three models, 30 prompts, full latency distribution, memory profiling per inference call, and a JSON validation layer to measure structured output reliability.

Here's what I found — and why the results matter for anyone deploying local models in production.

The Setup

Three models tested:

llama3.2:3b — 3B parameters, Q4_K_M quantization, 2 GB download
phi3:mini — 3.8B parameters, Q4_K_M, 2.3 GB download
mistral:7b — 7B parameters, Q4_K_M, 4.1 GB download

Hardware: CPU only, no GPU acceleration. This is the worst-case baseline — the scenario that exposes real latency and memory numbers.

30 test prompts across 5 categories:

Short factual (10): "What is the capital of France?"
Reasoning (8): "Explain why the sky appears blue."
Code generation (5): "Write a Python function to reverse a string."
Structured output (5): "List 3 frameworks in JSON format with name and use_case."
Multi-step (2): Complex chained reasoning tasks.

Architecture

POST /query
  → Pydantic validation → Ollama HTTP API → JSON Validator → QueryResponse

POST /benchmark
  → Load test_prompts.json
  → For each prompt: psutil memory before → Ollama → psutil memory after
  → NumPy: P50/P95/P99 latency, avg TPS, peak/avg memory
  → BenchmarkResult JSON

The benchmark runs prompts sequentially, not in parallel. Parallel would contaminate the per-prompt memory measurements.

Results

Llama 3.2 3B (Q4_K_M)

avg_tokens_per_second: 42.3
p50_latency_ms: 1203
p95_latency_ms: 3847
p99_latency_ms: 5120
peak_memory_mb: 6953
avg_memory_mb: 6842
total_test_duration_s: 87.4

Interpretation: P50 at 1.2 seconds is excellent. P95 at 3.8 seconds misses a 3-second SLA — the outliers are multi-step tasks and longer code generation. Memory is stable: the model loads once and stays hot between requests (Ollama's KV cache). Delta between peak and average is only 111 MB.

Phi-3 Mini (Q4_K_M)

avg_tokens_per_second: 4.7
p50_latency_ms: 29554
p95_latency_ms: 34127

Interpretation: 4.7 tok/s on CPU. A simple factual question takes 29 seconds. This is a CPU architecture issue — Phi-3 Mini's attention mechanism is less efficient on CPU-only inference than Llama's. With a GPU, these numbers would look very different. On CPU: not usable for interactive applications.

Mistral 7B (Q4_K_M)

avg_tokens_per_second: 28.1
p50_latency_ms: 2301
p95_latency_ms: 5912
peak_memory_mb: 14413

Interpretation: Best output quality, highest memory. 14 GB peak RSS means this model doesn't fit on machines with 8 GB RAM unless you close everything else. P95 at 5.9 seconds — slower than Llama 3.2 3B across the board, expected for a 7B model on CPU.

The JSON Validation Layer

One of the project's core features: send a JSON schema with your query, get validated structured output back.

POST /query
{
  "prompt": "List 3 programming languages",
  "json_schema": {
    "type": "object",
    "properties": {
      "languages": {"type": "array", "items": {"type": "string"}}
    }
  }
}

Without retry: 68% of responses matched the schema on the first attempt.

With retry + error injection:

retry_prompt = (
    f"{original_prompt}\n\n"
    f"Your previous response was invalid JSON. "
    f"Error: {validation_error}. "
    f"Please respond with valid JSON matching this schema: {schema}"
)

With retry: 94% success rate across all three models.

The error injection is what matters. Telling the model exactly what went wrong is significantly more effective than "try again."

What I Learned

1. P95 is the production number, not the average.

Average latency for Llama 3.2 3B is ~1.4 seconds. P95 is 3.8 seconds. If you set a 3-second SLA based on the average, you'll miss it 5% of the time. That's 1 in 20 users seeing a timeout. Measure the distribution, not the center.

2. Phi-3 Mini's CPU performance is misleading from the model card.

The model card advertises strong benchmark scores. Those scores are measured on GPU. On CPU-only inference, 4.7 tok/s makes it unusable for interactive applications. Always benchmark on your actual hardware.

3. Memory delta tells you more than peak.

Peak RSS includes OS overhead and Ollama itself. The delta between pre-inference and post-inference memory tells you how much the model's KV cache is actually growing per request. For Llama 3.2 3B, this delta was ~111 MB — relatively stable across prompt types.

4. Q4_K_M is the right default.

Ollama uses Q4_K_M by default. It's 4-bit quantization with K-means clustering, which recovers some quality compared to naive Q4_0. For factual and code tasks, quality degradation from FP16 to Q4_K_M is minimal. For complex reasoning tasks, there's a measurable drop — but at 4x the memory, FP16 isn't practical on consumer hardware anyway.

5. Sequential benchmarking is the only accurate method.

I tried parallelizing the benchmark for speed. The memory numbers became meaningless — Ollama's memory usage overlapped across concurrent requests and couldn't be attributed per-prompt. Sequential is slower but gives clean, attributable measurements.

Limitations

No GPU measurements. All results are CPU-only. Phi-3 Mini's poor CPU performance might reverse completely on GPU — it's designed for Apple Silicon and NVIDIA acceleration. If you have a GPU, run your own benchmark.

Single hardware configuration. Results are from one machine. RAM speed, CPU generation, and available cores all affect inference speed. These numbers are directional, not universal.

Quality scoring is manual. The benchmark measures latency and throughput automatically. Output quality is subjective and not automated here — it requires a golden dataset and an LLM judge (a separate project).

30 prompts is not statistically robust. P99 from 30 samples is noisy. A production benchmark should run 200+ prompts to get stable percentile estimates.

Try It

GitHub: [https://github.com/ThinkWithOps/02-local-ai-assistant]
Youtube : [https://youtu.be/SMI-eIn-tuw]

git clone https://github.com/ThinkWithOps/02-local-ai-assistant.git
cd 02-local-ai-assistant
bash scripts/install_models.sh  # pulls llama3.2:3b, phi3:mini, mistral:7b
pip install -r requirements.txt
uvicorn src.main:app --host 0.0.0.0 --port 8000

# Benchmark llama3.2:3b
python cli/main.py benchmark --model llama3.2:3b

# Compare all 3 models
python cli/main.py compare
# Generates: reports/model_comparison_YYYYMMDD.md

Which local model are you running, and what's your P95 latency? Drop it in the comments.

Top comments (1)

Vic Chen • Jun 5

Really solid benchmark. I especially liked that you centered P95 instead of averages and separated peak from steady-state memory. The 42.3 tok/s and 1.2s P50 for Llama 3.2 3B versus 4.7 tok/s for Phi-3 Mini on CPU is exactly the kind of hardware-specific reality check people skip when they quote model cards. I’d be curious to see a follow-up slice on first-token latency versus full-response latency, because for product UX that split often matters as much as total throughput.