Akshay Gore

Posted on Mar 1 • Edited on Mar 13

Self-Hosted AI on Linux: A DevOps Home Lab Guide

#devops #linux #ai #opensource

Audience: Intermediate DevOps/Systems Engineers | Series: Part 1 of 4

Fun Part:- Chat with you own LLM without worrying about token expiration.

Section 1 — Introduction

1.1 The 5 Layers of AI Ecosystem

Layer	Role	Dev / Home Lab	Production
5	Applications	Simple chatbot scripts	RAG pipelines, Agents, Chatbots
4	Frameworks	LangChain, LlamaIndex	LangChain, LlamaIndex, LiteLLM
3	Model Serving	Ollama	vLLM, TGI, Triton
2	Models	phi3:mini, gemma:2b	Mistral 7B, Llama 3 70B
1	Infrastructure	VirtualBox VM,Mac Mini M-series, Local hardware	AWS/GCP/Azure, GPU servers

This post covers Layers 1, 2 and 3. Layers 4 and 5 will be covered in posts ahead.

1.2 What This Post Covers

Setting up Ubuntu Server VM on VirtualBox. The server running the LLM.
Installing and configuring Ollama as a systemd service. Ollama is a program which helps to manage LLM model.
LLM model being used is phi which is superlight for homelab setup. It is similar to sonnet or gemini but on much smaller scale.
Automating the entire setup with Ansible
Interacting with the model via CLI, curl and Postman

Flow of setup : Ansible running on user's system to configure Ubuntu VM (phi) to run LLM

Section 2 — VM Setup

2.1 VM Specs

Component	Spec	Reason
RAM	8GB	phi3:mini needs ~3.7GB in memory, leave headroom for OS
CPU	4 cores	CPU inference benefits from multiple cores
Disk	30GB	Model 2.2GB + Ubuntu OS + logs + breathing room
OS	Ubuntu minimal Server 22.04 LTS	Stable, well supported, no GUI overhead
Network	Bridged Adapter	VM gets own IP, allows Ansible and API calls from other machines/clients leveraging model
Hostname	phi	Named after the model running on it

Note : One can leverage Mac Mini which has apple chips as it has UMA which is more capable of running models with higher transaction numbers.
Eg: Mac can run bit higher models like Llama 3 / 3.2, Mistral 7B . As I have a simple vm with no GPU, I am using basic LLM model called phi3:mini.

2.3 Hostname Setup

Named the vm phi. Will use this name ahead in ansible to keep things clean and simple.

Section 3 — Installing and Configuring Ollama manually

Walk through manual installation first so readers understand what Ansible automates in the next section.

3.1 Installation

Single curl command install. curl -fsSL https://ollama.com/install.sh | sh Ollama : Is a free, open-source tool that allows you to easily download, set up, and run AI language models (like LLaMA 3, Mistral, and Gemma) directly.It acts like a "Docker for LLMs" managing the technical complexities so you can quickly run private, offline AI chat or coding assistants with a single command.
Systemd service created automatically once the script completes successfully.

3.2 Systemd Override Configuration

OLLAMA_HOST=0.0.0.0 — To accept connections from all the clients in subnet which has model running.
OLLAMA_KEEP_ALIVE — control model unload timeout. So if model is not queried for 5 mins, the OS will unload it from RAM automatically to free the OS.
StandardOutput / StandardError — redirect logs to custom path. Try to put this on a separate partition other than root or entirely different disk.

Note: LLM models are loaded in RAM from disk before they are served. It has term called "warming up the model". In production setups something called heartbeat is used to keep model constantly warmed up and ready to serve as it affects the user experience.

3.3 Log Configuration

Create dir /var/log/ollama with correct ollama:ollama ownership
Using custom log location to get all the logs as there is difference in verbosity. Journactl will filter the logs but we would need all the logs from stdout and stderr

3.4 Logrotate

Config file at /etc/logrotate.d/ollama

Few commands to use logrotate:

logrotate --debug /etc/logrotate.d/ollama - dry run
logrotate --force /etc/logrotate.d/ollama - force run
ls -lh /var/log/ollama/ - check if logs rotated

In above ss we can see the logs got rotated

3.5 Pull and Test Model

ollama pull phi3:mini
ollama list — verify download
ollama run phi3:mini — quick interactive test

Your privale llm model is up and running. Ready to answer your queries.

Section 4 — Automating with Ansible

Now that we understand every manual step, lets automate it all.

4.1 Repository Structure

4.2 Running the Playbook

Dry run ansible-playbook -i inventory.ini playbook.yml --check

Faced an error here because the service was not installed yet. This is handled in playbook.

Run the playbook ansible-playbook -i inventory.ini playbook.yml

4.3 GitHub Repo

akshaypgore / llm-ansible

Ansible role to deploy llm model phi3:mini on linux vm

Prerequisite

VM Specs:

Min 8GB RAM (Model phi3:mini is approximately 3 GB. Half of the RAM would be consumed by model and other half reserved for OS)
4 cores
30 GB HDD

Note: System used to run ansible should be able to ssh vm without password using public key authentication

Steps

Update inventory file with IP of vm and username used to run ansible
Dry run ansible-playbook -i inventory.ini playbook.yml --check
Run Playbook ansible-playbook -i inventory.ini playbook.yml

View on GitHub

Section 5 — Interacting with the Model

Two ways to interact — CLI and curl. Each progressively more useful for building applications.

5.1 CLI — ollama run

ollama --version
ollama list
ollama ps
ollama run phi3:mini
ollama show phi3:mini

In the above image we can see that ollama unloaded the model as it was not being used. We had to run ollama run phi3:mini to reload the model in RAM which is also called warming up.

5.2 REST API via curl

This is the important part — how applications actually talk to Ollama. Below are few endpoints which are exposed

/api/generate — single prompt
/api/chat — conversation with history and roles

1. curl http://localhost:11434 #check if model is running
2. curl http://localhost:11434/api/generate #prompt like experience. Ask a question, model answers
3. #interaction with LLM like a chat. Question and Anser
curl http://localhost:11434/api/chat \
  -d '{
    "model": "phi3:mini",
    "stream": false,
    "messages": [
      {
        "role": "user",
        "content": "What is Linux?"
      },
      {
        "role": "assistant",
        "content": "Linux is an open source operating system..."
      },
      {
        "role": "user",
        "content": "Who created it?"
      }
    ]
  }' | python3 -c "import sys,json; print(json.load(sys.stdin)['message']['content'])"

Some interaction with our own LLM model.

There are many more important elemets to discuss ahead in future posts. Like

Performance Metrics of LLM

eval_count : Number of tokens (words) generated
eval_duration : Time required to generate tokens
total_duration : Time required to execute the query #### Context and cost tracking
prompt_eval_count : Tokens consumed in input along with tokens in chat history
load_duration : Time to load model in memory of server

DEV Community