Ben Santora

Posted on Jan 30

Running llama-cli on Linux

#cli #linux #llm #tutorial

If you’ve experimented with local LLMs, you’ve likely used Ollama, LM Studio, or Jan.ai. These tools are excellent for accessibility, but as a Linux user, you might find yourself wanting more control, less background "magic," and higher performance.

Today, we’re going "under the hood." We are moving from the wrapper (ollama) to the engine (llama.cpp) to extract every bit of power from your local silicon.

Before we touch the terminal, let’s clear up the hierarchy:

llama.cpp is the Engine: A raw C++ implementation of the Llama architecture. It is the core mathematical library that performs inference.

Ollama is the Wrapper: It bundles llama.cpp with a Go-based management layer, a model registry, and a background service.

Why switch to the CLI?

Transparency: No hidden daemons. When the process ends, your RAM is 100% empty.

Performance: 10–20% faster token generation by cutting out software overhead.

Hardware Mastery: You can explicitly target instruction sets like AVX-512, which generic binaries often ignore for the sake of compatibility.

The Universal Baseline (Hardware)

To run modern 7B or 8B models (like Llama 3.1 or Mistral) comfortably on Linux without a dedicated GPU, aim for:

A PC manufactured in 2020 or newer.

RAM: 12GB+ (8GB works, but 12GB+ prevents swapping).

CPU: Intel 11th Gen+ or AMD Ryzen 5000+.

OS: Any modern Linux distro (Debian/Ubuntu preferred for simplicity).

The Instruction Set Audit

The "secret power" of local AI is AVX (Advanced Vector Extensions). To see what your CPU supports, run:

lscpu | grep -E --color=always "avx(2|512)"

Look through the results for 'avx' flag - should be in color

AVX2: The standard baseline.

AVX-512 / VNNI: The gold standard. If you see avx512_vnni, your CPU can process AI-specific math significantly faster.

Installation: Building for Your Silicon

We don't download a binary; we compile one. This ensures the engine is optimized specifically for your CPU flags.

# Install build essentials
sudo apt update && sudo apt install build-essential cmake git wget

# Clone the repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Build with 'native' flags (this auto-detects AVX2/AVX-512)

cmake -B build -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)

Downloading the Model

llama.cpp uses the GGUF format. Unlike Ollama's pull command, you grab these directly from HuggingFace. For a 12GB RAM machine, the Q4_K_M (4-bit) quantization is the sweet spot. If your machine is on the low end of the requirements, consider starting with a smaller parameter scale - ie - a 4B or even 1.5B

mkdir models
wget -O models/llama-3.1-8b-q4.gguf https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

Operation: The Power User Commands

The primary binary is llama-cli. Here is how you run a local session:

./build/bin/llama-cli -m models/llama-3.1-8b-q4.gguf -cnv --mlock -t 4 --color on

This is clearly a long command - consider creating an alias for it and placing in .bashrc - note that the command much match exactly the name of the model in your models folder. Determining your machine's ability to run the chosen model is what you first have to learn. If your terminal shows the model loading but it takes longer than 4 or 5 min, remove the -mlock flag from the above command. If it's still taking too long, drop down to a smaller model size - ie - a 1.5B and get that working on that machine. You can always experiment with larger models later.

Understanding the Flags:

-cnv: Enables Conversation Mode. It handles the chat templates automatically.

--color: Visually separates your prompt (green/cyan) from the model's response (white).

--mlock: Critical for laptops. It "pins" the model to your physical RAM so the OS can't swap it to disk, eliminating lag.

-t 4: Matches the number of physical cores (not logical threads) for maximum efficiency.

Verification: Is it working?

When you launch the command, watch the first 10 lines of output for the system_info line.

system_info: n_threads = 4 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 |

If AVX512 = 1, you have successfully optimized your AI assistant to the limit of your hardware. If you miss it during launch, don't worry about it - keep watch for it the next time.

If you've made it to the interface and have a green cursor, you're golden. You are now running a private, hyper-optimized LLM with zero telemetry and 100% transparency. Ask the model something simple to test it the first time - ie - "Give me 100 words on photosynthesis."

Unfortunately, it's impossible to write a guide like this and have it work perfectly for everyone, due to all of the different machines, system requirements and other factors that all need to be in place. Even a small mistake like mistaking a capital letter with a lower-case in the model name can stop you.

If you can't get this working, I can confirm that Gemini and probably most of the other online LLMs will be able to walk you through the steps. Or post your issues here and we will help you figure it out.

Good Luck!

Ben Santora - January 2026

DEV Community

Running llama-cli on Linux

Build with 'native' flags (this auto-detects AVX2/AVX-512)

Top comments (0)