A first basic test putting my hands on LLaMA.CPP
Introduction
What is LLama.CPP
llama.cpp
is a highly optimized C/C++ library designed to run large language models (LLMs)—like Meta's Llama family—very efficiently on standard consumer hardware, including common desktop CPUs (which is my case 😥). Its primary innovation is the use of quantization formats, such as GGUF, which dramatically compress the model size and lower memory requirements, essentially serving as a powerful, local engine that makes advanced AI accessible without requiring specialized, expensive dedicated GPUs or relying on cloud services.
What is GGUF Format
The GGUF (GGML Unified Format) is a file format specifically engineered for storing and running large language models (LLMs) efficiently on local consumer hardware. Developed by the creators of the llama.cpp project, its key feature is that it allows models to be quantized, meaning their complex weight data is compressed (often down to 4-bit or 8-bit precision) to significantly reduce the file size and memory footprint without severe performance loss. Critically, a GGUF file is self-contained, bundling all the necessary components—the model weights, vocabulary, and metadata (like the tokenizer and chat template)—into a single file, making it easy to download and deploy an LLM with minimal configuration across various operating systems and hardware configurations.
The reason I did this test
My journey into local Large Language Model (LLM) relies on Ollama on my own hardwaee. While Ollama offeres a sleek, user-friendly abstraction layer for quickly downloading and running models, I’ve now chosen to dive deeper into the powerful, low-level mechanics of LLaMA.cpp (one should try everthing…🤭).
A Simple Application Implementation
For my very first test, as always I wanted to use a Granite model 🪨, and so I wrote a simple application (mostly inspired by the video provided in links section).
- First step, download the GGUF file for the model you want to use. I picked a Granite model on Hugging Face -> https://huggingface.co/ibm-granite/granite-3.3-8b-instruct-GGUF/blob/main/granite-3.3-8b-instruct-Q2_K.gguf
- Prepare your environment 🏭
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install llama-cpp-python
# if you're lucky enough to have GPUs on your machine (which I have not tested)
# with NVIDIA...
# pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
# and on Apple Silicon..
# CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python
- The application code ⤵️
import os
from llama_cpp import Llama
# --- Configuration ---
# 1. Define the path to your downloaded GGUF model file
# NOTE: Ensure this path is correct relative to where you run this script.
MODEL_PATH = "./models/granite-3.3-8b-instruct-Q2_K.gguf"
# 2. Define output settings
OUTPUT_DIR = "./output"
# Updated filename to reflect the request for a full application
OUTPUT_FILENAME = "granite_fibonacci_app_output.md"
OUTPUT_FILE_PATH = os.path.join(OUTPUT_DIR, OUTPUT_FILENAME)
# 3. Define the prompt
# Prompt is updated to request a complete, runnable application with validation
PROMPT = "Write a complete, runnable Python command-line application script that prompts the user for a number N, calculates and prints the first N numbers of the Fibonacci sequence using a non-recursive (iterative) approach for efficiency, and includes basic input validation."
# --- Setup and Execution ---
# Check if the model file exists before proceeding
if not os.path.exists(MODEL_PATH):
print(f"Error: Model file not found at path: {MODEL_PATH}")
print("Please download the GGUF file and place it in the correct location.")
exit(1)
# Create the output directory if it doesn't exist
try:
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"Output directory '{OUTPUT_DIR}' ensured.")
except Exception as e:
print(f"Error creating output directory: {e}")
exit(1)
# 4. Initialize the Llama model
print(f"Loading model from {MODEL_PATH}...")
try:
llm = Llama(
model_path=MODEL_PATH,
n_ctx=4096, # Context window size
n_threads=8, # Number of CPU threads
n_gpu_layers=-1, # Offload all layers to GPU (if installed with GPU support)
verbose=True # See llama.cpp loading logs
)
except Exception as e:
print(f"Error initializing Llama model: {e}")
print("Check if llama-cpp-python is installed correctly, especially with GPU support if requested.")
exit(1)
# 5. Generate the completion
print(f"\n--- Generating completion with prompt for full application: '{PROMPT}' ---\n")
try:
output = llm(
PROMPT,
max_tokens=512, # Increased max_tokens to allow for a longer, complete script
temperature=0.2,
# Stop tokens: common delimiters used by LLMs to signify the end of a response or a new turn.
stop=["\n#", "User:", "<|im_end|>"],
echo=False
)
except Exception as e:
print(f"Error during model generation: {e}")
exit(1)
# 6. Extract and format the result
response_text = output["choices"][0]["text"].strip()
print("Model Response (Preview):")
print(response_text)
# Prepare the content for the Markdown file
markdown_content = f"""# Granite Model Response: Full Fibonacci Application Script
**Model Used:** `{MODEL_PATH}`
**Date/Time Generated:** {os.path.getmtime(MODEL_PATH)}
""" ## Prompt
"""```
markdown
"""{PROMPT}
"""
```"""
"""**remove all**
"""## Generated Code (Full Application Script)
"""```
python"""
"""{response_text}"""
"""
```"""
---
"""*Generation Parameters: n_ctx=4096, temperature=0.2, max_tokens=512*
"""
# 7. Write the output to the Markdown file
with open(OUTPUT_FILE_PATH, 'w', encoding='utf-8') as f:
f.write(markdown_content)
print(f"\n✅ Successfully saved output to: {OUTPUT_FILE_PATH}")
- Once the application runs, it generated a lot of console output, but the essential part comes in the markdown file (as it is generated by the model, I didn’t do any changes to the file) 📑
Conclusion — Should one choose between Ollama or LLaMA.cpp?
Choosing between Ollama and the core llama.cpp framework depends entirely on the user’s priority: convenience versus control. Ollama is highly praised for its superior ease-of-use , offering a simple command-line interface and standardized API that handles model downloading, serving, and environment setup automatically — a major pro for beginners and quick prototyping, though its con is obscuring the underlying performance controls.
Conversely, llama.cpp provides the ultimate flexibility and optimization , allowing expert users to precisely configure hardware utilization (like CUDA or Metal) and quantization parameters to squeeze the maximum performance out of specific machines. Its con is the steep learning curve, as it often requires manual compilation and configuration steps like setting CMAKE arguments, making it less accessible for casual use.
So I guess I’ll use mostly Ollama but after this test, I’m also going to use LLaMA.cpp from times to times :)
Thanks for reading 🤗
Links
- LLaMA.cpp wikipedia page: https://en.wikipedia.org/wiki/Llama.cpp
- llama.cpp GitHub: https://github.com/ggml-org/llama.cpp
- Video made “Sunny Solanki — CoderzColumn” by which helped me sto start: Llama-CPP-Python: (Step-by-step Guide to Run LLMs on Local Machine) https://www.youtube.com/watch?v=0SK6H9Vmw6M
- Granite 3.3 GGUF on Hugging Face: https://huggingface.co/ibm-granite/granite-3.3-8b-instruct-GGUF/blob/main/granite-3.3-8b-instruct-Q2_K.gguf
Top comments (0)