DEV Community

Jangwook Kim
Jangwook Kim

Posted on • Originally published at jangwook.net

Why a local LLM''s first reply sometimes takes 10 seconds — I measured the cold start (load_duration)

I have been running a local agent on my MacBook for a few days. When I step away to do something else and come back to the same agent, the first reply is noticeably sluggish. The second and third are fine; only that first one drags. While writing yesterday's post decomposing prefill and generation cost, I wrote that I warmed the model before measuring "so model load time (load_duration) would not contaminate the numbers." That line nagged at me. The thing I deliberately threw away was exactly the delay I feel most often in daily use.

So today I measured the cost I excluded yesterday. The time it takes a model to land in memory, the thing we casually call cold start.

load_duration: the line item you usually don't see

Ollama's /api/generate returns a bundle of timestamps on every response. Yesterday I looked at prompt_eval_duration (prefill) and eval_duration (generation). There is one more at the front: load_duration. As the name says, it is the time spent loading the model.

There is a reason you rarely notice it. Call the same model back to back and from the second call on the model is already resident, so load_duration reads near zero. Leave it idle for a while and Ollama evicts the model from memory (five minutes by default), and the next call resurrects the load cost. That eviction is exactly what I was feeling when "stepping away and coming back" felt slow.

I kept the method simple. To isolate load time, the prompt is one line, Reply with the single word: ok, and num_predict is capped at 8 so generation collapses toward zero. To force a cold state, I call ollama stop <model> right before the request. Then the load_duration of the first call is the cold start.

def gen(model, keep_alive="5m"):
    body = json.dumps({
        "model": model, "prompt": "Reply with the single word: ok",
        "stream": False, "keep_alive": keep_alive,
        "options": {"num_predict": 8, "temperature": 0}
    }).encode()
    req = urllib.request.Request(OLLAMA, data=body,
        headers={"Content-Type": "application/json"})
    with urllib.request.urlopen(req, timeout=600) as r:
        d = json.loads(r.read())
    return d["load_duration"] / 1e6  # nanoseconds -> milliseconds
Enter fullscreen mode Exit fullscreen mode

If you want to check it yourself, one curl line does it. Stop the model, call it once, and pull out load_duration.

ollama stop gemma4:12b-it-qat
curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma4:12b-it-qat", "prompt": "ok", "stream": false
}' | python3 -c 'import sys,json; print(json.load(sys.stdin)["load_duration"]/1e9, "s")'
Enter fullscreen mode Exit fullscreen mode

The value comes back in nanoseconds, so divide by 1e9 for seconds. Run that line across a few models and you immediately feel how the table below shifts on your own hardware.

There is one reason to trust this measurement. Ollama returns load_duration separately from prompt_eval_duration and eval_duration, so load time does not bleed into the prefill or generation numbers. The response's total_duration came out close to the sum of those three, which let me isolate the load cleanly. Yesterday I looked at the middle two; today I focus on just the first.

Cold start by model size

I lined up four Gemma 4 models I had pulled, ordered by size. For each I ran ollama stop, then called it cold three times, plus once warm with the model resident. In seconds:

Model On-disk size Cold #1 Cold #3 Warm
melavisions/gemma4 2.0 GB 3.33s 1.55s 0.20s
yinw1590/gemma4-e2b 3.1 GB 3.57s 1.79s 0.38s
gemma4:12b-it-qat 7.2 GB 9.00s 2.82s 0.37s
gemma4:e4b 9.6 GB 9.71s 3.86s 0.37s

Cold-start load_duration by model size

The broad pattern is what you would expect: bigger model, longer load. The 9.6GB model's first cold start was 9.7 seconds; the same call warm was 0.37 seconds. A 26x gap. In practice, that means if you leave a 7.2GB local chatbot idle past five minutes and speak to it again, you burn several seconds before a single token appears.

What jumps out is the warm column. Whether the model is 2GB or 9.6GB, warm load_duration sat at 0.2 to 0.4 seconds, basically flat. It does not scale with size. The way I read it, this is not actually re-reading weights; it is the keep_alive bookkeeping overhead of Ollama confirming "this model is still up." It is not a real load, so it ignores size. I won't claim to know exactly what work it represents. But for practical purposes, 0.4 seconds warm is "effectively no load cost," and that is the conclusion the measurement supports.

Why "cold" #1 and #3 differ by 2x

Look at the table again and something is off. I stopped the model and re-measured every single time, yet Cold #1 is nearly twice as slow as Cold #3. The 7.2GB model went from 9.00 to 2.82 seconds, the 9.6GB one from 9.71 to 3.86. Both are labeled "cold," but the numbers disagree.

I got stuck here for a while. I first suspected a measurement bug. The answer was the operating system's page cache. ollama stop only evicts the model from the Ollama process's memory; the OS keeps the model file it already read sitting in RAM as page cache. So Cold #2 and #3 re-read the file from RAM, not disk. Drop the disk I/O entirely and it speeds up.

This matters because the thing we casually call "cold start" is really two different things.

  • Truly cold: right after a reboot, or when memory pressure has flushed the cache. Weights are read from disk for the first time. This is Cold #1.
  • Cached cold: the model is evicted from Ollama but the file is still in the page cache. This is Cold #3.

If you do not separate these when benchmarking, the second measurement onward quietly picks up the cached value, and you reach the rosy conclusion "cold start is faster than I thought." A real production server reboots, and swapping between several models flushes the page cache. So when setting an SLA or a cold-start budget, base it on Cold #1, the post-reboot worst case, not Cold #3. Had I not known this and measured once, I would have written the 7.2GB cold start as "2.8 seconds" when the real worst case was 9.

The interesting part is that the gap is much smaller on small models. The 2.0GB model's Cold #1 (3.33s) and Cold #3 (1.55s) differ by about 1.8 seconds, while the 9.6GB model's 9.71s and 3.86s differ by almost 6. More bytes to read from disk means the page cache saves you more time. The bigger the model, the steeper the penalty the "first user after reboot" absorbs. If you plan to serve 13B-class or larger locally, treat this cache dependency as a real operational variable.

keep_alive splits the bill

The most direct lever against cold start is keep_alive: how long to hold the model in memory. I put it at two extremes and hit the same 7.2GB model three times each.

Request keep_alive="0" (unload each time) keep_alive="10m" (stay warm)
Request #1 7.10s 2.56s
Request #2 2.55s 0.38s
Request #3 2.55s 0.38s

load_duration by keep_alive setting

The contrast is sharp. keep_alive="0" unloads the model immediately after serving a request, so every request is cold. Each one eats 2.5 seconds or more of load up front. Check ollama ps between requests and the model is not in memory.

keep_alive="10m" pays the cold value (2.56s) only on the first request, then drops to 0.38 seconds. It shoves the cold start into request one and serves the rest warm. Request #1 of the keep_alive=0 run spiked to 7.1 seconds because the page cache was also empty at that point, a truly cold start. The effect from the previous section shows up here too.

On the command line, the OLLAMA_KEEP_ALIVE environment variable or the API's keep_alive field controls the same thing. Set it to -1 to keep the model resident indefinitely.

So how should I run a local agent?

Measuring turned a few vague operational hunches into something concrete.

First, for chat or agent use, give keep_alive plenty of room. If the model is evicted every time a user speaks, every turn is a cold start. Adding 2.5 seconds per turn on a 7.2GB model wrecks the conversation. As long as memory allows, set a long keep_alive or pin it with -1. This is a setting you can drop straight onto the deployment from my Ollama plus FastAPI production serving guide.

Second, warm the model once at startup. Have your boot script fire a dummy prompt to pay the cold start in advance, so the first real user never eats the 9 seconds. You cannot avoid paying the cold cost on the first request, but that first request does not have to be a real user.

Third, routing across several models is pricier than it looks. Calling a different model per request triggers a reload each time, and if memory is tight they flush each other's page cache down to a true cold (#1 level). If you build a router that rotates four models, compute load cost times switch count up front.

Fourth, benchmark inference speed only after warming. That is precisely why I warmed before measuring yesterday. Measure once while cold and load_duration stacks 9 seconds on top of prefill and generation, so you cannot tell whether the model is slow or the load is. The same principle held in my output reproducibility experiment. Fix every variable except the one you are measuring.

Fifth, memory and responsiveness are a trade. A long keep_alive erases cold starts past the first request, but that model occupies RAM the whole time. Pin a 9.6GB model indefinitely and you shrink what other work can use; load another model and the page cache gets flushed, reviving cold. So I decided which models stay up first, gave a long keep_alive to the one or two I use most, and kept the rest short. Holding every model warm is a luxury only enough memory affords. Hammering in keep_alive=-1 to drive cold start to zero just returns it as a bigger cold on the next model's load.

Limits and what I still don't know

Let me draw the boundary honestly. These numbers come from one MacBook (Apple Silicon, unified memory). A server with a CUDA GPU has an extra step in the load path, copying from disk through system RAM into VRAM, so the absolute values will differ. Don't transplant my numbers onto other hardware. That said, the structural conclusions should survive a hardware change: cold scales with size while warm does not, page cache splits cold into two kinds, and keep_alive decides every cost past the first request.

Also, I did not dig deep enough into Ollama internals to claim exactly what load_duration sums up. It may include initialization like graph construction, not just the file read. What I can observe is the number the API returns and how it responds to model size, page cache, and keep_alive. That range is the scope of today's measurement. The 0.37 seconds that shows up even when warm is something I guessed at, not confirmed.

Finally, page cache behavior depends on how much free RAM you have. On a memory-tight server, even Cold #2 and #3 could get their cache flushed quickly and slow back down toward Cold #1. My measurement leans toward the optimistic case with ample RAM. Next I want to apply artificial memory pressure and see how long the cache holds. Cold start is not a measure-once topic; it is the kind of cost you have to re-measure per environment.

Top comments (0)