DEV Community

Cover image for The Local LLM Ecosystem Doesn't Need Ollama (And That Made Me Uncomfortable)
Juan Torchia
Juan Torchia

Posted on • Originally published at juanchi.dev

The Local LLM Ecosystem Doesn't Need Ollama (And That Made Me Uncomfortable)

Ollama just added native tool support for more models and the community is hyped. I get it — I've been using it, I defend it on Twitter when someone complains, and I've had it running on my machine for over a year. But I have something to say that probably isn't what you'd expect from someone who built their first local MCP server with Ollama as the backend.

Last week I tried to cut it out of the equation. Completely. And what I found made me uncomfortable enough to write this.

Local LLM Without Ollama: What Happens When You Go Straight to llama.cpp

The context: I was building a pipeline that needs to run inference from a Docker worker — no interface, no OpenAI-compatible API, nothing that isn't strictly necessary. One model, one input, one output. That's it.

Ollama in that context feels like driving a semi truck to pick up a loaf of bread. It comes with an HTTP server, model management, caching, a REST API, logs, updates… all of that has a cost. Not dramatic, but real.

So I tried the obvious alternative: raw llama.cpp, with a minimal Python wrapper.

# Install llama-cpp-python with CUDA support (if you have a GPU)
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

# Or without GPU, CPU only
pip install llama-cpp-python
Enter fullscreen mode Exit fullscreen mode
from llama_cpp import Llama

# Load the model directly — no server, no intermediate magic
llm = Llama(
    model_path="./models/mistral-7b-instruct-q4_k_m.gguf",
    n_ctx=4096,          # max context
    n_threads=8,         # CPU threads
    n_gpu_layers=35,     # layers on GPU (0 if you don't have one)
    verbose=False        # silence the llama.cpp noise
)

def run_inference(prompt: str) -> str:
    # Direct call — no HTTP, no JSON, no network overhead
    result = llm(
        prompt,
        max_tokens=512,
        temperature=0.7,
        stop=["</s>", "[INST]"],  # stop tokens for Mistral
        echo=False
    )
    return result["choices"][0]["text"].strip()

# Let's test it
response = run_inference("[INST] Explain what a composite index is in PostgreSQL [/INST]")
print(response)
Enter fullscreen mode Exit fullscreen mode

That's it. No server. No port 11434. No ollama pull. The model is a .gguf file you grab from Hugging Face and point to directly.

The Numbers I Wasn't Expecting

On my machine (Ryzen 7, 32GB RAM, RTX 3060 12GB):

Setup Time to first token Extra memory overhead
Ollama + model loaded ~180ms ~120MB overhead
llama-cpp-python direct ~95ms ~0MB overhead
Ollama cold start ~3.2s
llama-cpp-python cold start ~1.8s

These aren't differences that'll change your life in interactive use. But in a pipeline running 500 inferences per hour, they start to matter.

Where Raw llama.cpp Falls Short and Where Ollama Actually Shines

Here's the uncomfortable part. After two days with the minimalist setup, I started missing specific things about Ollama. Not the server. Not the API. Concrete things:

1. Model management. ollama pull llama3.2 is one line. With llama-cpp-python you have to go to Hugging Face, find the right GGUF for your VRAM, download it manually, and hope the format is compatible. It's not complicated, but it's friction.

2. Automatic prompt compatibility. Ollama knows the chat template for each model. With raw llama.cpp, you have to format the prompt yourself:

# With Ollama — this works for any model
# ollama.chat(model="mistral", messages=[{"role": "user", "content": "hey"}])

# With llama-cpp-python — you need to know each model's format
def mistral_format(message: str) -> str:
    return f"[INST] {message} [/INST]"

def llama3_format(message: str) -> str:
    # Llama 3 has a completely different format
    return f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

def qwen_format(message: str) -> str:
    # And Qwen is yet another one
    return f"<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n"

# You can use the built-in chat handler in llama-cpp-python
# but it requires additional per-model configuration
Enter fullscreen mode Exit fullscreen mode

This seems minor until you want to swap models and your pipeline breaks because you forgot to update the template.

3. The OpenAI-compatible API. If you're plugging your local LLM into an agent, LangChain, an MCP server, or anything that speaks OpenAI — Ollama gives you that for free. With llama-cpp-python you have to spin up the server yourself:

# llama-cpp-python DOES have an OpenAI-compatible server
# but you have to launch it explicitly
python -m llama_cpp.server --model ./models/mistral-7b.gguf --port 8000

# Or programmatically
from llama_cpp.server.app import create_app
from llama_cpp.server.settings import ModelSettings, ServerSettings

# This is what Ollama does for you, with better DX
Enter fullscreen mode Exit fullscreen mode

At some point, if you need a compatible API and multi-model support, you're just rebuilding Ollama. Which reminds me of something I wrote about not reimplementing what already exists when you're building agents.

The Mistakes I Made and the Gotchas Nobody Warns You About

Gotcha #1: the model in memory doesn't free itself.

With Ollama, models get unloaded from memory after a configurable timeout. With raw llama-cpp-python, the model lives as long as your Python object does. In a long-running pipeline, this matters:

import gc
from llama_cpp import Llama

class ModelManager:
    def __init__(self, model_path: str):
        self.path = model_path
        self._model = None

    def load(self):
        if self._model is None:
            self._model = Llama(model_path=self.path, n_gpu_layers=35)

    def unload(self):
        """Explicitly free VRAM — Ollama does this automatically"""
        if self._model is not None:
            del self._model
            self._model = None
            gc.collect()  # force garbage collection

    def infer(self, prompt: str) -> str:
        self.load()
        return self._model(prompt, max_tokens=512)["choices"][0]["text"]
Enter fullscreen mode Exit fullscreen mode

Gotcha #2: context size and VRAM.

Ollama handles this with sensible defaults. With llama-cpp-python, if you set n_ctx=8192 and the model plus context doesn't fit in VRAM, the process either dies silently or llama.cpp offloads to CPU without telling you clearly. Always verify:

# Check whether the model loaded on GPU or fell back to CPU
llm = Llama(model_path="./model.gguf", n_gpu_layers=35, verbose=True)
# Look in the logs for: "llm_load_tensors: offloaded X/Y layers to GPU"
# If Y < n_gpu_layers, something didn't fit in VRAM
Enter fullscreen mode Exit fullscreen mode

Gotcha #3: the Docker image.

Ollama has an official image. With llama-cpp-python you have to build your own, and if you need CUDA, the base image easily hits 6GB before you add anything. I learned this the hard way when I was optimizing Docker images — same principle applies here: multi-stage, only what you need:

# CUDA base image — this already weighs a ton
FROM nvidia/cuda:12.1-devel-ubuntu22.04 AS builder

RUN apt-get update && apt-get install -y python3-pip git cmake

# Compile llama-cpp-python with CUDA from source
ENV CMAKE_ARGS="-DLLAMA_CUDA=on"
RUN pip install llama-cpp-python --no-cache-dir

# Final image — runtime only
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
COPY --from=builder /usr/local/lib/python3.*/dist-packages /usr/local/lib/python3.10/dist-packages

# Add your code, not the compiler
COPY ./src /app
WORKDIR /app
Enter fullscreen mode Exit fullscreen mode

FAQ: Local LLM Without Ollama — Questions I Got This Week

Is it worth replacing Ollama with raw llama.cpp?
Depends on the context. For development, exploration, or anything that needs to swap models frequently: no. Ollama wins on DX by a mile. For production pipelines where the model is fixed, latency matters, and you don't need an API: yes, it makes sense to evaluate llama-cpp-python or even the raw llama.cpp binary.

How much faster is llama.cpp without the Ollama overhead?
In my tests, between 10% and 40% depending on the model and hardware. The biggest difference is in cold start (~45% faster) and local network overhead. In tokens per second with the model already loaded, the difference is much smaller — the bottleneck is the inference itself.

Does llama-cpp-python support the same models as Ollama?
Every GGUF model that works in llama.cpp works in llama-cpp-python. Which is basically everything — Llama, Mistral, Qwen, Phi, Gemma, DeepSeek. The difference is that with Ollama you do ollama pull name and with llama.cpp you have to download the .gguf manually from Hugging Face or use huggingface_hub.

What about tool calls / function calling without Ollama?
This is where Ollama still has the edge. Tool calls require the model to support the right format AND the runtime to handle it properly. llama-cpp-python has basic support via grammar-based sampling, but it's more manual. If your pipeline depends on function calling, Ollama (or LM Studio for desktop) is still more comfortable. This is exactly what I missed most when I was testing integrations for automating repetitive workflows.

Does it make sense to use both in the same project?
Absolutely. Ollama for local development and experimentation, llama-cpp-python direct for the production worker with a fixed model. They're not mutually exclusive and it's not overengineering — they're different tools for different cases.

Are there other alternatives besides llama.cpp?
Yes: LM Studio has a compatible API (but it's a desktop app), GPT4All has Python bindings, vLLM is the serious option for multi-GPU and high throughput (but requires CUDA and weighs more). For code-embedded use in Python without a server, llama-cpp-python is the most mature option today.

Ollama Solves UX. That's Not Nothing, But It's Not Everything.

Here's the conclusion that took me two days of minimalist setup to accept: Ollama is a developer experience tool, not an infrastructure tool. And that's perfectly fine. It solves a real problem — making running a local LLM accessible to anyone with a decent GPU.

But when you start plugging models into real pipelines, Docker workers, systems that don't have a human watching a terminal, the Ollama abstraction might be solving problems you don't have while adding overhead you don't want.

The 40-second query I brought down to 80ms with a composite index taught me that most optimizations aren't about switching to a different tool — they're about understanding what your current tool actually does, and when that abstraction costs more than it gives you.

Ollama is great. Keep using it. But if you're building something in production with local LLMs, it's worth understanding what's underneath. Even if what you find makes you a little uncomfortable.

Are you running local LLMs in production? What's your setup? I'm genuinely curious whether anyone else reached the same conclusion from a different direction.

Top comments (0)