DEV Community

Cover image for Deploy Ollama with s6-overlay to serve and pull in one shot
Ahsan Nabi Dar
Ahsan Nabi Dar

Posted on

Deploy Ollama with s6-overlay to serve and pull in one shot

Ollama brings the power of Large Language Models (LLMs) directly to your local machine. It removes the complexity of cloud-based solutions by offering a user-friendly framework for running these powerful models.

Ollama is a robust platform designed to simplify the process of running machine learning models locally. It offers an intuitive interface that allows users to efficiently manage and deploy models without the need for extensive technical knowledge. By streamlining the setup and execution processes, Ollama makes it accessible for developers to harness the power of advanced models directly on their local machines, promoting ease of use and faster iterations in development cycles.

However, Ollama does come with a notable limitation when it comes to containerized deployments. To download and manage models, Ollama must be actively running and serving before the models can be accessed. This requirement complicates the deployment process within containers, as it necessitates additional steps to ensure the service is up and operational before any model interactions can occur. Consequently, this adds complexity to Continuous Integration (CI) and Continuous Deployment (CD) pipelines, potentially hindering seamless automation and scaling efforts.

On Ollama's docker hub it has clear instructions over how to run Ollama requiring 2 steps. In the 1st step you need to have Ollama running before you can download the model to have it ready for prompting.


docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama


docker exec -it ollama ollama run llama3


On their Discord there is a help query about how to do this in one shot with a solution which is good but not something I would put in production to lack of orchestration and supervision of processes. Its on github as autollama and I recommend to check it out to learn some new tricks.

discord issue

This is where I leveraged my past experience of using s6-overlay to setup serve and pull in a single container with serve as a longrun and pull as a oneshot dependent on serve to be up and running.

The directory structure for it as below

ollama-s6-dir

It runs flawlessly with pull running well supervised and orchestrated for it to complete and even when the download gets hammered due to internet speeds it keeps the process going without a glitch.

ollama s6 downloading

Currently there is a known issue in s6-overlay for service wait time which initially caused the oneshot to timeout. Had to S6_CMD_WAIT_FOR_SERVICES_MAXTIME=0 to disable it for the model download to not fail.

It is alive, at this point I was just super happy how smoothly it came up
ollama running

On following run pull only gets the diff if any without the need to download the whole model again.
Ollama Pull diff

And Ollama has an api that you can prompt and its a charm to play around with.
prompt ollama

With serve and pull in a single container to be served along your application it simplifies not only your deployments but also your CI to test it without overly complicating things by hacking scripts.

I have put the repo on github as ollama-s6 for anyone looking to productionize their ollama deplyoments.

Top comments (0)