Own AI chatbot in terminal

Lately I was looking for an opportunity to expand my knowledge in algorithms. Particularly for sharpening designing skills for optimisation purposes.
Hence I found it useful to incorporate my PC powers to give me some boost in this matter.
I did a bit of research and it turns out that there is plenty of options having Apple M2 Silicon on board with 16GB of RAM.
My choice was Qwen2.5-7B. Why You could ask? Well first and foremost it has excellent reasoning particularly in Mathematics and logical exercises which I find useful in tasks I am about to throw at it.
Secondly it is reliably small which makes it maintainable in long term. Also it does not eat up too much RAM memory to avoid swap overworking.
So once You have this covered You can download the model. I was fully aware it needs quantisation by some well renowned party.

So I went under this address:
https://huggingface.co/bartowski/Qwen2.5-Math-7B-Instruct-GGUF/tree/main
And downloaded the model below: Qwen2.5-Math-7B-Instruct-Q4_K_M.gguf

I chose Qwen2.5-Math-7B-Instruct because of its strong performance in mathematical reasoning and logical problem-solving, which aligns well with algorithmic practice tasks. Its relatively small size also makes it suitable for local execution on Apple Silicon machines with 16GB of unified memory, without excessive reliance on disk swapping.

To run it efficiently, I used a quantised GGUF version. Quantisation reduces the precision of model weights (for example, to 4-bit integers), significantly lowering memory usage while maintaining acceptable output quality. In this case, the Q4_K_M variant reduces the model size to approximately 4–5 GB.

The goal is to ensure the model fits entirely in unified memory, avoiding swap usage, which would severely degrade inference speed and responsiveness.

So why is it that important? Ideally the size of model should fit in the free RAM memory not being used. Because Apple Silicon uses a unified memory architecture, it is important that the model fits in available memory to avoid swapping to disk. What this brings us to? Well you want the model to generate text in reasonable time to conversation to go fluently. Secondly you want to avoid situation where the model overflows the fast RAM memory and leak into SWAP memory (that is the part of You physical hard drive which serves as additional RAM) that is painfully slow and increase wearing out of your SSD.

Having all this set up we need to think about how would we use our brand new model on PC. You need an engine to run it on the pc. My choice was llama.cpp because I usually like to tinker with things and it gives endless possibilities in that terms.

What you will need here at the beginning is a cmake. It is a tool to build programs in C/C++ language.
You can easily install it by running: brew install cmake

Then I strongly advise to compile the source code from source. Without Metal enabled, the model runs entirely on CPU, which significantly reduces performance.

So now you need a separate folder and in there You should execute commands below:
git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp cmake -B build -DGGML_METAL=ON cmake --build build --config Release

Now engine is set up and you can give LLM a go.
To do this assuming that you stored a model under models directory the command should look something like this:

And this brings us to another point. You can now talk to the chatbot and have the whole university library university in one place! This makes local experimentation much more accessible.

DEV Community

Own AI chatbot in terminal

Top comments (0)