DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

KoboldCpp vs Ollama vs llama.cpp for Creative Writing 2026

This article was originally published on aifoss.dev

TL;DR: KoboldCpp wins for fiction and roleplay outright — its sampler stack, story mode, and native SillyTavern API make it the only one purpose-built for creative text. Ollama is better for developers who need a clean API for chat-first apps. llama.cpp direct is for power users who want raw token control and are comfortable scripting their own pipelines.

KoboldCpp v1.114.1 Ollama v0.24.0 llama.cpp (b9145)
Best for Roleplay, fiction, long-form story Developer chatbots, API consumers Custom scripts, research, maximum flexibility
Sampler control Full (DRY, mirostat, CFG, XTC, temperature, top-k, top-p, min-p) Partial (temperature, top-p, top-k via API params) Full via server params or completion endpoint
Prompt formatting Raw or template — your choice Forces model chat template by default Raw via /completion endpoint
Story features Story mode, World Info, lorebooks, branching None None
License AGPLv3 MIT MIT
Setup effort Download single binary curl install + ollama pull Build from source or download binary

Honest take: Use KoboldCpp if you're writing fiction or doing roleplay. Use Ollama if you're building a developer project. Use llama.cpp raw only if you want to pipe output through a script and know what you're doing.


Why the Generic "LLM Runner" Reviews Miss Creative Writers

Every comparison of local LLM runners treats them as interchangeable chat backends. For code generation or Q&A, they largely are. For fiction writing and roleplay, the differences are sharp and practical.

Creative writers and game developers need three things that general-purpose LLM runners treat as afterthoughts:

  1. Sampler control. The quality of generated prose is heavily sensitive to sampling parameters — specifically to repetition handling, which is the biggest single failure mode in long-form generation. A model that keeps cycling back to the same phrases or sentence structures destroys narrative immersion. DRY (Dynamic Repetition Yesterday), mirostat, and XTC samplers exist to fix exactly this, and not all runners expose them.

  2. Prompt formatting freedom. Chat-optimized frontends force all input through a structured Human/Assistant message format. That's fine for Q&A. Fiction generation often needs raw continuation — you give the model partial prose and it keeps writing, not a chatbot response. Runners that lock you into a chat template break this workflow.

  3. Story memory. A 32k token context fills up. You need tools for deciding what stays in context (scene summaries, character descriptions, active plot) versus what gets trimmed. Generic runners don't address this at all.


The Three Tools at a Glance

KoboldCpp v1.114.1 is built directly on llama.cpp's inference core but wraps it in a GUI and API layer specifically designed for creative text generation. It's a single executable — download, point at a GGUF file, run. Maintained by LostRuins on GitHub under AGPLv3. Active development, with releases roughly every two to four weeks.

Ollama v0.24.0 is the most popular local LLM runner for developers. Clean CLI, model library at ollama pull, OpenAI-compatible REST API. The template system handles dozens of model formats automatically. MIT license. Excellent for building chat applications; not designed for fiction workflows.

llama.cpp (build b9145) is the underlying inference engine that KoboldCpp (and Ollama, for many operations) is built on. When you run it directly via llama-server or llama-cli, you get the full sampler chain without any UI layer. Maximum control, minimum hand-holding. MIT license.


Sampler Control: The Clearest Win for KoboldCpp

This is where creative writers should pay the most attention.

KoboldCpp exposes the full sampling stack from its GUI and API: temperature, top-k, top-p, min-p, tail-free sampling, typical sampling, repetition penalty, presence penalty, frequency penalty, DRY, mirostat (v1 and v2), CFG (classifier-free guidance), and XTC.

For creative writing specifically, three samplers matter most:

DRY (Dynamic Repetition Yesterday) is a token-level anti-repetition sampler designed as a more nuanced alternative to the blunt repetition-penalty approach. Instead of penalizing any repeated token equally, DRY tracks N-gram sequences and penalizes the continuation of recently-seen patterns. This matters enormously in long-form prose: the model won't repeat the same descriptive phrase four paragraphs later even if repetition-penalty alone wouldn't catch it.

Mirostat controls the perplexity of output directly rather than adjusting raw token probabilities. It adapts sampling parameters in real time to target a specific "surprisingness" level. For fiction generation, Mirostat v2 with a tau of around 5.0 tends to produce more coherent long-form prose than pure top-p sampling.

XTC (Exclude Top Choices) strips the most statistically likely tokens when there are many valid continuations, forcing the model toward less predictable but more interesting choices. The same authors who wrote DRY wrote XTC — the two work well together.

Ollama's API exposes temperature, top_k, top_p, repeat_penalty, and a handful of others, but DRY, mirostat, XTC, and CFG are not available through the standard Ollama API. You can't set them in a Modelfile either.

llama.cpp's llama-server server exposes the full sampler chain — the same defaults as KoboldCpp in many cases — via its /completion endpoint. The sampler order is: penalties, DRY, top_n_sigma, top_k, typ_p, top_p, min_p, XTC, temperature. You can override any of these per-request in JSON. So llama.cpp direct is a functional alternative for sampler control, just with no UI.

Example llama-server request with DRY enabled:

curl http://localhost:8080/completion \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "The old lighthouse keeper looked out at the storm and",
    "n_predict": 200,
    "temperature": 0.75,
    "dry_multiplier": 0.8,
    "dry_base": 1.75,
    "dry_allowed_length": 2,
    "mirostat": 2,
    "mirostat_tau": 5.0,
    "repeat_penalty": 1.1
  }'
Enter fullscreen mode Exit fullscreen mode

KoboldCpp wraps this same kind of control behind a GUI slider set, making it accessible to writers who don't want to craft JSON payloads.


Prompt Formatting: Chat Templates vs Raw Continuation

This is the second critical difference for fiction writing.

Ollama uses a template system that converts chat messages into the input format each model expects. When you use ollama run mistral, it wraps your prompt in Mistral's [INST] tokens. When you use the API, you send structured messages and Ollama handles formatting. For chat, this is exactly right. For raw continuation, it's a problem.

Fiction generation often means: given this partial paragraph, continue writing. You don't want the model to interpret that as a user message and generate a "helpful assistant" response. You want it to keep writing prose in the same voice and style. Ollama's template system makes this harder than it should be. You can write a custom TEMPLATE in a Modelfile to work around it, but it's not the path of least resistance.

KoboldCpp and llama.cpp raw both support true continuation by default. KoboldCpp's "Story Mode" in the built-in UI is literally designed around this — you write a passage, press continue, and the model picks up the thread without any chat wrapper. The /api/generate endpoint for KoboldAI-style requests works the same way.

llama.cpp's /completion endpoint also accepts raw prompts with no template enforcement. If you send a raw string, it generates a raw continuation. This makes it a viable alternative for developers scripting story generation pipelines.


Context Length and Memory Management

All three runners support long context in the underlying model. Setting context in each:


bash
# KoboldCpp GUI: Co
Enter fullscreen mode Exit fullscreen mode

Top comments (0)