DEV Community

Cover image for No GPU? No problem!, running local AI efficiently on my CPU.
Erick Mwangi Muguchia
Erick Mwangi Muguchia

Posted on

No GPU? No problem!, running local AI efficiently on my CPU.

1. Why I tried this.

2. My setup.

3. The problems I faced.

4. The tweaks I discovered.

5. Results.

6. Lessons.

THE WHY:

I’ve always wanted to explore deep conversations with AI and understand how these systems work. For a long time, that dream was limited by the lack of a GPU and the high cost of apps that allow meaningful interaction with AI models. Now, I’m determined to overcome those barriers and build my own path into this world.

So the idea of running one locally always kept bugging me, and I finally got to do it.
It was a long and educational journey, so let me walk you through it.

My setup

OS: Arch Linux x86_64
Workflow: tmux + i3 (just because I like using key-bindings), starship + wezterm
Hardware :CPU: Intel(R) Core(TM) i5-7200U (4) @ 3.10 GHz
GPU: Intel HD Graphics 620 @ 1.00 GHz [Integrated]
Memory: 2.75 GiB / 7.61 GiB (36%)
Swap: 666.95 MiB / 3.81 GiB (17%)

Storage: Disk (/): 28.64 GiB / 31.20 GiB (92%) - ext4
Disk (/home): 39.15 GiB / 84.33 GiB (46%) - ext4
Disk (/run/media/shinigami/Vault): 349.96 GiB / 457.38 GiB (77%) - ext4

Ollama is the engine I used to run models locally. By default, it stores models in ~/.ollama, which quickly filled my root partition.
install with:
curl -fsSL https://ollama.com/install.sh | sh

To fix this I redirected the models to a secondary storage.

mv ~/.ollama /run/media/shinigami/Vault/ollama
ln -s /run/media/shinigami/Vault/ollama ~/.ollama

This symlink forces Ollama to store everything on the Hard Drive , solving the disk full error.

Now pulling and managing models.

I started by pulling the smallest model, so I went for the Tinyllama, which is about 637 MB.
After testing the model, > fast yes but not that smart, so I had to look for a bit smarter models.
ollama pull tinyllama

Therefore I went ahead and pulled heavy models like llama3.2:3b, and llama3.2:1b, which were 2.0 GB and 1.3 GB respectively.
ollama pull llama3.2:3b
ollama pull llama3.2:1b

Building custom builds for them
So I wanted maximum output/use of the models so I created custom models using modelfiles.
Example;
I wanted one to help me with learning basic networking concepts, the model files looks like this:


FROM llama3.2:3b # Adjust accordingly

# ---------- PARAMETERS ----------
# Lower temperature for accuracy; moderate top_p for variety without drifting.
PARAMETER temperature 0.25
PARAMETER top_p 0.9

# More room for code + explanations.
PARAMETER num_ctx 4096

# Safer repetition control (prevents looping).
PARAMETER repeat_penalty 1.12

# If your setup supports it and you want more deterministic answers, you can also try:
# PARAMETER seed 42

# ---------- SYSTEM BEHAVIOR ----------
SYSTEM """
You are My Mentor: a concise, practical networking fundamentals tutor and coding assistant.

Primary goal:
- Teach networking fundamentals clearly and correctly (OSI/TCP-IP, IP addressing & subnetting, ARP, DNS, DHCP, TCP vs UDP, ports/sockets, routing, NAT, HTTP/TLS basics, troubleshooting with ping/traceroute/nslookup/curl/tcpdump, basic firewalls).

Secondary goal:
- Produce meaningful, runnable code examples when useful prefer (enter preferred language).

Style rules:
- Explain concepts with short definitions + one concrete example.
- When writing code, include: what it does, how to run it, expected output, and common pitfalls.
- If the user’s question is ambiguous, ask up to 2 clarifying questions before answering.

Output format (default):
1) Concept (2–5 sentences)
2) Why it matters (1–2 bullets)
3) Example (diagram, packet flow, or command)
4) Code (only if it adds value; keep it minimal)
5) Quick check (2–3 questions to self-test)
""" 
Enter fullscreen mode Exit fullscreen mode

That snippet explains much of it.

Building the models

After creating the Model file, vim Modelfile1, now we bind it to the any models we downloaded.
A Modelfile is the blueprint that shapes an AI model’s personality, rules, and behavior on top of its base intelligence.

Command for creating from the Modelfile,
ollama create My-model -f Modelfile1

you should see this output:

gathering model components 
using existing layer sha256:dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff 
using existing layer sha256:966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396 
using existing layer sha256:fcc5a6bec9daf9b561a68827b67ab6088e1dba9d1fa2a50d7bbcc8384e0a265d 
using existing layer sha256:a70ff7e570d97baaf4e62ac6e6ad9975e04caa6d900d3742d37698494479e0cd 
creating new layer sha256:7f89bd8bf6ef609a9aefeab288cde09db6c1ef97f649691f25b29e0f85a8c91c 
creating new layer sha256:446b3a23f7599dc79a11cfb03c670091c9fe265aba28fa3316e9e46dc86365db 
writing manifest 
success 

Enter fullscreen mode Exit fullscreen mode

"My-model" > you can name it anything.
Plus you can create as many Modelfile as you like giving them different task, and of course you can add more rules and examples in the Modelfile as you like.

After successfully creating 'My-model', run the model,
ollama run My-model

➜ ollama run My-model
>>> Send a message (/? for help)

Enter fullscreen mode Exit fullscreen mode

That was the fun part, now the real problem was now when the models runs and now CPU starts screaming because of 100% CPU consumption, overheating, which ends up making the model work slow.

So I had to optimize my setup to better handle the models.
By using the tools to monitor CPU (htop), and heat (lm_sensors), I was able to better optimize my setup.
To run the models efficiently I had to:

Maximize CPU performance when running the models.
Reduce latency of bottlenecks.
Stabilize thermal behavior.
Prioritize computer-heavy processes.

Running local AI models on CPU introduces:

Lower parallelism.
Thermal Throttling.
OS scheduling inefficiencies.
Power-saving defaults limiting performance.

So instead of forcing the model, You optimize around it.
Unlock CPU performance
In Arch
sudo pacman -s cpupower
then install the tuned for changing CPU frequency state and for system wide optimization.
sudo pacman -S tuned
Then enable it :
sudo systemctl enable --now tuned

Then change the CPU performance;
sudo cpupower frequency-set -g performance

Switches CPU governor: By default, Linux CPUs often run in “ondemand” or “powersave” mode, scaling frequency up and down depending on load.
Performance mode: Locks the CPU to its maximum frequency, ensuring consistent speed.

Impact:
Faster response times for heavy workloads (like tokenization, AI inference, or compiling).
Reduced latency spikes since the CPU doesn’t waste time ramping up.
More predictable benchmarking results.

cons:
Higher power draw, more heat, fans spin up, battery drains faster on laptops.

Then confirm the configuration by,
cpupower frequency-info

➜ cpupower frequency-info
driver: acpi-cpufreq
hardware limits: 400 MHz - 2.60 GHz
available cpufreq governors: conservative ondemand userspace powersave performance schedutil
current policy: governor "performance" within 400 MHz - 2.50 GHz
current CPU frequency: 3.10 GHz (kernel reported)
boost state: Supported, Active

Enter fullscreen mode Exit fullscreen mode

Then apply throughput performance profile, using the 'tuned' we installed earlier.
sudo tuned-adm profile throughput-profile

Why this matters:

Optimizes CPU behaviors.
Improves disk I/O.
Adjusts system scheduling.
Reduces unnecessary power-saving interruptions.

Results: Smoother, sustained compute performance.

Model am using (and why)

  1. llama3.2:3b

    Balanced size and capability.
    Noticeably smart.
    Good for deeper prompts and reasoning.
    This felt like middle ground between speed and intelligence.

  2. phi3:mini

    Very efficient for its size.
    Strong reasoning compared to other small models.
    Optimized for lower-resources environments
    This one stood out as surprisingly powerful for CPU use.

This concludes my first phase: setting up, tuning performance, and confirming that local AI models run smoothly. In the next phase, I’ll dive into measuring tokenization speed; using verbose logs and custom C scripts to compare how these models perform under different workloads.
"Turns out you don't need powerful hardware to explore AI, just curiosity and a stubborn CPU."

Top comments (0)