A few weeks ago, I started using Ollama to run language models (LLM), and I've been really enjoying it a lot. After getting the hang of it, I thought it was about time to try it out on one of our real-world cases (I'll share more about this later).
At Direktiv we are using Kubernetes for all our deployments and when I tried to run it as a pod, I faced a couple of issues.
The initial issue I faced was Ollama downloading models as needed, which is logical given its support for multiple models. When starting up, the specific model required has to be fetched, with sizes ranging from 1.5GB to 40GB. This really extends the time it takes for the container to start up.
To start the download, you'd either make an API call or get the CLI going to fetch the model you need. In a Kubernetes setup, you can easily handle this using a lifecycle event in postStart
. So, here's a simple example of an Ollama deployment I put together:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
spec:
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:0.1.29
ports:
- name: http
containerPort: 11434
protocol: TCP
lifecycle:
postStart:
exec:
command: [ "/bin/sh", "-c", "ollama pull gemma:2b" ]
That went okay, but there is still the startup problem - it took ages to run the lifecycle hook, plus it won't function on Kubernetes nodes with no internet access. At Direktiv were are using Knative a lot as well which does not support lifecycle events. So, my plan was to create a container using the Ollama image as base with the model pre-downloaded.
So, a little hiccup is that Ollama runs as an HTTP service with an API, which makes it a bit tricky to run the pull model
command when building the container image to have the models ready to go right from the start. No services in docker build
, remember?
There have been a couple of GitHub issues pointing out this problem, but the workaround is to start an Ollama container, pull the model, and then transfer the generated models into a new container build. Personally, I found this process not the best for an automated build.
Got my developer gloves on and thought, "How hard can it be?" ๐งค Excited that all the download functions in the project were exported, but oh boy, the dependencies didn't play nice! Ended up having to copy and tweak the existing setup. Voila! Now we've got a neat little container for a multi-stage build. Check out the project here:
https://github.com/jensg-st/ollama-pull ๐ฅ
With this container, you can fetch the model in the first stage - in this scenario, it's gemma:2b
. For the main container you can still use the default ollama/ollama
image. The model simply needs to be copied from the downloader
to the main container at /root/.ollama
. You can even download multiple models in the first stage.
FROM gerke74/ollama-model-loader as downloader
RUN /ollama-pull gemma:2b
FROM ollama/ollama
ENV OLLAMA_HOST "0.0.0.0"
COPY --from=downloader /root/.ollama /root/.ollama
Let's build it and run it:
cat << 'EOF' > Dockerfile
FROM gerke74/ollama-model-loader as downloader
RUN /ollama-pull gemma:2b
FROM ollama/ollama
ENV OLLAMA_HOST "0.0.0.0"
COPY --from=downloader /root/.ollama /root/.ollama
EOF
docker build -t gemma .
docker run -p 11437:11434 gemma
The curl command sends the question to the container. It is important to use the right value in model
. In this case gemma:2b
.
curl http://localhost:11437/api/generate -d '{
"model": "gemma:2b",
"prompt": "Why is the sky blue?"
}'
The container will respond like that:
{"model":"gemma:2b","created_at":"2024-03-26T15:16:56.780177872Z","response":"The","done":false}
{"model":"gemma:2b","created_at":"2024-03-26T15:16:57.003156881Z","response":" sky","done":false}
{"model":"gemma:2b","created_at":"2024-03-26T15:16:57.223483082Z","response":" appears","done":false}
...
Please feel free to comment if that was helpful or if something is not working. In the next few posts I will add some real-life functionality to this.
Top comments (0)