DEV Community

Akshay Gore
Akshay Gore

Posted on

Self-Hosted AI on Linux: A DevOps Home Lab Guide

Audience: Intermediate DevOps/Systems Engineers | Series: Part 1 of 4

Fun Part:- Chat with you own LLM without worrying about token expiration.

Section 1 — Introduction

1.1 The 5 Layers of AI Ecosystem

Layer Role Dev / Home Lab Production
5 Applications Simple chatbot scripts RAG pipelines, Agents, Chatbots
4 Frameworks LangChain, LlamaIndex LangChain, LlamaIndex, LiteLLM
3 Model Serving Ollama vLLM, TGI, Triton
2 Models phi3:mini, gemma:2b Mistral 7B, Llama 3 70B
1 Infrastructure VirtualBox VM,Mac Mini M-series, Local hardware AWS/GCP/Azure, GPU servers

This post covers Layers 1, 2 and 3. Layers 4 and 5 will be covered in posts ahead.

1.2 What This Post Covers

  • Setting up Ubuntu Server VM on VirtualBox. The server running the LLM.
  • Installing and configuring Ollama as a systemd service. Ollama is a program which helps to manage LLM model.
  • LLM model being used is phi which is superlight for homelab setup. It is similar to sonnet or gemini but on much smaller scale.
  • Automating the entire setup with Ansible
  • Interacting with the model via CLI, curl and Postman

Flow of setup : Ansible running on user's system to configure Ubuntu VM (phi) to run LLM


Section 2 — VM Setup

2.1 VM Specs

Component Spec Reason
RAM 8GB phi3:mini needs ~3.7GB in memory, leave headroom for OS
CPU 4 cores CPU inference benefits from multiple cores
Disk 30GB Model 2.2GB + Ubuntu OS + logs + breathing room
OS Ubuntu minimal Server 22.04 LTS Stable, well supported, no GUI overhead
Network Bridged Adapter VM gets own IP, allows Ansible and API calls from other machines/clients leveraging model
Hostname phi Named after the model running on it

Screenshot of virtual machine specs

Note : One can leverage Mac Mini which has apple chips as it has UMA which is more capable of running models with higher transaction numbers.
Eg: Mac can run bit higher models like Llama 3 / 3.2, Mistral 7B . As I have a simple vm with no GPU, I am using basic LLM model called phi3:mini.

2.3 Hostname Setup

Named the vm phi. Will use this name ahead in ansible to keep things clean and simple.

Screenshot of vm hostname


Section 3 — Installing and Configuring Ollama manually

Walk through manual installation first so readers understand what Ansible automates in the next section.

3.1 Installation

  • Single curl command install. curl -fsSL https://ollama.com/install.sh | sh Ollama : Is a free, open-source tool that allows you to easily download, set up, and run AI language models (like LLaMA 3, Mistral, and Gemma) directly.It acts like a "Docker for LLMs" managing the technical complexities so you can quickly run private, offline AI chat or coding assistants with a single command.
  • Systemd service created automatically once the script completes successfully.

Screenshot of ollama service up and running

3.2 Systemd Override Configuration

  • OLLAMA_HOST=0.0.0.0 — To accept connections from all the clients in subnet which has model running.
  • OLLAMA_KEEP_ALIVE — control model unload timeout. So if model is not queried for 5 mins, the OS will unload it from RAM automatically to free the OS.
  • StandardOutput / StandardError — redirect logs to custom path. Try to put this on a separate partition other than root or entirely different disk.

Note: LLM models are loaded in RAM from disk before they are served. It has term called "warming up the model". In production setups something called heartbeat is used to keep model constantly warmed up and ready to serve as it affects the user experience.

Screenshot of ollama config

3.3 Log Configuration

  • Create dir /var/log/ollama with correct ollama:ollama ownership
  • Using custom log location to get all the logs as there is difference in verbosity. Journactl will filter the logs but we would need all the logs from stdout and stderr

3.4 Logrotate

  • Config file at /etc/logrotate.d/ollama

Screenshot of logroatate config

Screenshot of logrotate service working as expected
Few commands to use logrotate:

  1. logrotate --debug /etc/logrotate.d/ollama - dry run
  2. logrotate --force /etc/logrotate.d/ollama - force run
  3. ls -lh /var/log/ollama/ - check if logs rotated

Screenshot of logrotate commands

In above ss we can see the logs got rotated

3.5 Pull and Test Model

  • ollama pull phi3:mini
  • ollama list — verify download
  • ollama run phi3:mini — quick interactive test

Your privale llm model is up and running. Ready to answer your queries.

Screenshot of ollama basic commands

Section 4 — Automating with Ansible

Now that we understand every manual step, lets automate it all.

4.1 Repository Structure

Screenshot of directory structure of llm ansible repo

4.2 Running the Playbook

  1. Dry run ansible-playbook -i inventory.ini playbook.yml --check

Screenshot of ansible dry run

Faced an error here because the service was not installed yet. This is handled in playbook.

  1. Run the playbook ansible-playbook -i inventory.ini playbook.yml

Screenshot of ansible being executed

4.3 GitHub Repo


Ansible role to deploy llm model phi3:mini on linux vm

Prerequisite

VM Specs:

  1. Min 8GB RAM (Model phi3:mini is approximately 3 GB. Half of the RAM would be consumed by model and other half reserved for OS)
  2. 4 cores
  3. 30 GB HDD

Note: System used to run ansible should be able to ssh vm without password using public key authentication

Steps

  1. Update inventory file with IP of vm and username used to run ansible
  2. Dry run ansible-playbook -i inventory.ini playbook.yml --check
  3. Run Playbook ansible-playbook -i inventory.ini playbook.yml




Section 5 — Interacting with the Model

Two ways to interact — CLI and curl. Each progressively more useful for building applications.

5.1 CLI — ollama run

ollama --version
ollama list
ollama ps
ollama run phi3:mini
ollama show phi3:mini
Enter fullscreen mode Exit fullscreen mode

Screenshot of ollama commands executed

In the above image we can see that ollama unloaded the model as it was not being used. We had to run ollama run phi3:mini to reload the model in RAM which is also called warming up.

5.2 REST API via curl

This is the important part — how applications actually talk to Ollama. Below are few endpoints which are exposed

  • /api/generate — single prompt
  • /api/chat — conversation with history and roles
1. curl http://localhost:11434 #check if model is running
2. curl http://localhost:11434/api/generate #prompt like experience. Ask a question, model answers
3. #interaction with LLM like a chat. Question and Anser
curl http://localhost:11434/api/chat \
  -d '{
    "model": "phi3:mini",
    "stream": false,
    "messages": [
      {
        "role": "user",
        "content": "What is Linux?"
      },
      {
        "role": "assistant",
        "content": "Linux is an open source operating system..."
      },
      {
        "role": "user",
        "content": "Who created it?"
      }
    ]
  }' | python3 -c "import sys,json; print(json.load(sys.stdin)['message']['content'])"
Enter fullscreen mode Exit fullscreen mode

Screenshot of interaction with ollama

Some interaction with our own LLM model.

Screenshot of interaction with LLM

There are many more important elemets to discuss ahead in future posts. Like

Performance Metrics of LLM

  1. eval_count : Number of tokens (words) generated
  2. eval_duration : Time required to generate tokens
  3. total_duration : Time required to execute the query #### Context and cost tracking
  4. prompt_eval_count : Tokens consumed in input along with tokens in chat history
  5. load_duration : Time to load model in memory of server

Top comments (0)