DEV Community

Abdullah Sheikh
Abdullah Sheikh

Posted on

How to Run AI Models Locally Without a GPU: A Complete Step‑by‑Step Guide

Learn how to set up, optimize, and execute popular AI models on a CPU‑only machine in just a few hours

Before We Start: What You'll Walk Away With

By the end of this guide you’ll be able to fire up a modern AI model on a laptop that only has a CPU, just like ordering a take‑out meal without waiting for the chef’s special grill.

First you’ll know exactly which hardware specs and operating‑system settings are sufficient for CPU‑only inference. Think of it as checking your suitcase weight before a flight: you’ll avoid the surprise “too heavy” notice at the gate.

Next you’ll install the right Python packages and model‑optimizers, then configure them so they talk to each other without hiccups. It’s similar to setting up a GPS: you input the destination, and the software finds the fastest route.

Finally you’ll run a real‑world model, time how long it takes, and have a checklist for the most common glitches. This is like watching a timer while you bake a cake, so you know exactly when to pull it out.

  • Identify CPU cores, RAM, and OS version needed for run AI models locally without GPU.

  • Install torch, transformers, and optimum (or similar) and tweak settings for CPU execution.

  • Load a sample model, benchmark inference speed, and apply quick fixes for memory or slowdown issues.

  • Hardware check: 4 + CPU cores, 8 GB RAM, recent Linux/macOS/Windows build.

  • Python env: Use a virtual environment to keep dependencies tidy.

  • Optimization tip: Enable torch.set_num_threads() to match your core count.

Ready to start? Let’s get the environment set up so you can dive straight into model testing.

What Running AI Models Locally Without a GPU Actually Is (No Jargon)

Running AI models locally without a GPU means you take a pre‑trained neural network and let your computer’s CPU do the heavy lifting instead of a graphics card. The CPU isn’t built for the massive parallel math that GPUs excel at, so you’ll notice slower response times, but clever software settings—like reducing precision or batching smaller inputs—keep the lag from becoming unbearable.

Think of it like driving a regular sedan on a highway where most cars are race‑cars. The sedan (your CPU) can still reach the destination, but it won’t zip past you. If you tune the engine a bit—choose a smoother route, keep the speed steady, and avoid sudden accelerations—you’ll arrive without breaking down, just a few minutes later than the flash‑cars.

The 3 Mistakes Everyone Makes With Running AI Models on CPU

Most CPU‑only attempts crash because they ignore the three classic traps.

  • Assuming the default install is fast enough. A fresh pip install torch gives you a version that talks to the CPU like a taxi driver who takes the scenic route. You’ll wait forever for a single inference. Instead, grab the optimized build (e.g., torch==2.0.0+cpu) or switch to tensorflow-cpu which talks directly to low‑level math libraries.

  • Skipping quantization and pruning. Think of a model as a suitcase packed with clothes. Quantization folds the fabric tighter, while pruning removes the bulk you never wear. Applying torch.quantization.quantize_dynamic or TensorFlow’s tf.lite.experimental.optimize can shave 70‑90% off inference time, turning a sluggish stroll into a quick dash.

  • Neglecting OS‑level settings. Your CPU is like a car; without the right fuel (BLAS libraries) it sputters. Installing openblas or intel‑mkl, then setting OMP_NUM_THREADS to match your core count, can double throughput. On Linux, export MKL_DEBUG_CPU_TYPE=5 forces the most efficient code paths.

Fix these three, and you’ll finally get a usable run AI models locally without GPU experience.

How to Run AI Models Locally Without a GPU: Step‑By‑Step

First, make sure your CPU can actually handle the heavy lifting.

  • Check for AVX2/AVX‑512 support (lscpu on Linux) and install the math libraries your framework needs, e.g. sudo apt-get install libopenblas-dev. Think of this like confirming your kitchen has a stove before you start cooking.

Create an isolated Python environment and pull in a CPU‑only deep‑learning stack:

python -m venv venv
source venv/bin/activate
pip install torch==2.2.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
# or pip install tensorflow-cpu
Enter fullscreen mode Exit fullscreen mode

This keeps your “ingredients” from mixing with other projects.
Grab a model that runs comfortably on a CPU. For example, download distilbert-base-uncased from Hugging Face:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
Enter fullscreen mode Exit fullscreen mode

Compress the model with post‑training quantization. Using Optimum:

from optimum.intel import INCQuantizer
quantizer = INCQuantizer.from_pretrained(model)
quantized_model = quantizer.quantize(save_dir="quantized")
Enter fullscreen mode Exit fullscreen mode

This is like ordering a smaller pizza that still tastes great.
Tell the CPU how many threads to use:

export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
Enter fullscreen mode Exit fullscreen mode

It’s similar to setting the number of cashiers in a grocery line for optimal flow.
Run a quick inference test and note the latency:

import time, torch
inputs = tokenizer("The quick brown fox", return_tensors="pt")
start = time.time()
with torch.no_grad():
    outputs = model(**inputs)
print("Latency:", time.time() - start)
Enter fullscreen mode Exit fullscreen mode

Optional: Convert the model to ONNX and use ONNX Runtime for a final speed boost:

import torch.onnx
torch.onnx.export(model, (inputs["input_ids"],), "model.onnx")
Enter fullscreen mode Exit fullscreen mode
pip install onnxruntime
python -c "import onnxruntime as ort; sess=ort.InferenceSession('model.onnx'); print(sess.run(None, {'input_ids': inputs['input_ids'].numpy()}))"
Enter fullscreen mode Exit fullscreen mode

Now you have a repeatable, GPU‑free workflow ready for daily experiments.

A Real Example: Running a Sentiment Analyzer on a Laptop

Maya opens a terminal on her 2022 MacBook Air and gets the model ready in minutes.

  • Install the required libraries:
pip install torch==2.0.1 transformers==4.35.0 bitsandbytes==0.41.0
Enter fullscreen mode Exit fullscreen mode
  • Download the sentiment model and quantize it to 8‑bit:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import bitsandbytes as bnb

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 8‑bit quantization (like packing a suitcase tighter)
model = bnb.nn.Int8Params.from_pretrained(model_name, torch_dtype=torch.float32)
model = model.to("cpu")
Enter fullscreen mode Exit fullscreen mode
  • Set the thread count and run a quick test sentence:
export OMP_NUM_THREADS=8
Enter fullscreen mode Exit fullscreen mode
import torch, time

def predict(text):
    inputs = tokenizer(text, return_tensors="pt")
    start = time.time()
    with torch.no_grad():
        logits = model(**inputs).logits
    pred = logits.argmax().item()
    return "positive" if pred == 1 else "negative", (time.time() - start)*1000

sentence = "I love the new feature in our product!"
label, latency = predict(sentence)
print(f"Sentiment: {label}, latency: {latency:.1f} ms")
Enter fullscreen mode Exit fullscreen mode

On Maya’s M1 CPU the script prints something like Sentiment: positive, latency: 148.3 ms, a drop from the ~1.2 s you’d see without quantization and thread tuning. She’s now able to run AI models locally without GPU fast enough to experiment during lunch breaks.

  • Tip: Keep OMP_NUM_THREADS between 4–8 on a laptop; higher values may thrash the memory.

  • Tip: Store the quantized model in ~/.cache/huggingface/transformers to avoid re‑downloading.

The Tools That Make This Easier

First, create an isolated workspace so your CPU‑only setup doesn’t clash with other projects.

  • Python virtual environments – think of it as packing a separate suitcase for each experiment. Use python -m venv env or conda create -n cpu-env python=3.10 and activate it before installing anything.

PyTorch CPU‑only wheels – the “no‑GPU” version of the library. Install with a single command that points to the right index, just like ordering a specific dish from a menu:

pip install torch --index-url https://download.pytorch.org/whl/cpu
Enter fullscreen mode Exit fullscreen mode
  • Optimum (Hugging Face) – a Swiss‑army knife for quantization and ONNX export. It streamlines the steps you’d otherwise repeat manually, similar to how Google Maps suggests the fastest route without you having to plot each turn.

  • ONNX Runtime (CPU execution provider) – the high‑performance inference engine that runs the exported model. It’s like a well‑tuned engine that lets your car (the model) cruise efficiently on a CPU‑only road.

  • Intel® Extension for PyTorch – optional but valuable if your CPU supports AVX‑512. It adds a turbo boost, comparable to adding a performance chip to a standard engine.

Putting these tools together creates a smooth pipeline: set up a venv, pull the CPU‑only PyTorch wheel, use Optimum to quantize and export to ONNX, then fire it up with ONNX Runtime. The Intel extension can be dropped in for an extra speed bump.

With this toolbox, you can run AI models locally without GPU and keep your workflow tidy.

Quick Reference: Run AI Models Locally Without a GPU Cheat Sheet

Think of this as a pocket‑sized checklist you can keep open while you set up your CPU‑only AI workspace.

  • ✔️ Verify CPU capabilities – run lscpu (Linux) or check System Info (Windows) for AVX2 or AVX‑512. It’s like confirming a car has a manual transmission before you try to drive it.

✔️ Create a clean virtual environment

python -m venv .venv && source .venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

then install the CPU‑only build:

pip install torch==2.2.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
Enter fullscreen mode Exit fullscreen mode

or pip install tensorflow-cpu. Fresh venv prevents version clashes.
✔️ Grab your model and quantize – download the model with transformers, then apply 8‑bit quantization via optimum:

from optimum.intel import INCModelForCausalLM
model = INCModelForCausalLM.from_pretrained("gpt2", quantization_config="bnb8")
Enter fullscreen mode Exit fullscreen mode

Think of quantization as packing a suitcase tighter so you can fit more clothes (parameters) in the same bag (RAM).
✔️ Set threading environment variables – tell the libs how many cores to use:

export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
Enter fullscreen mode Exit fullscreen mode

It’s like telling a kitchen how many chefs can work simultaneously.

  • ✔️ Run a quick sanity check – execute a short script that does a forward pass and prints latency. Example persona: Alice runs time python test.py and notes the milliseconds per token.

✔️ Optional speed boost – export to ONNX and serve with ONNX Runtime:

torch.onnx.export(model, dummy_input, "model.onnx")
Enter fullscreen mode Exit fullscreen mode

Then pip install onnxruntime and run. Treat ONNX like a Google Maps shortcut that skips the scenic route.

  • ✔️ Watch memory usage – stay under your RAM limit; only switch to torch.float16 if the CPU reports half‑precision support. It’s like using a smaller backpack when the luggage compartment is tight.

  • 💡 Tip: Keep torch.backends.quantized enabled for extra gains.

  • 💡 Tip: Pin the process to a specific core with taskset if you see jitter.

  • 💡 Tip: Log psutil.virtual_memory() before and after loading the model to spot leaks.

With this cheat sheet, you can confidently run AI models locally without a GPU and avoid the usual roadblocks.

What to Do Next

Start small, then build up – here’s a three‑step ladder you can climb right now.

  • Run the sentiment‑analysis example on your own text. Grab a paragraph from a recent email or a news article and feed it to the script you set up. Think of it like ordering a single dish at a new restaurant to see if you like the kitchen.

  • Quantize a larger model, such as bert-base-uncased, and measure latency. Use torch.quantization.quantize_dynamic to shrink the model, then time a few inference calls. It’s similar to packing a suitcase more tightly – you fit more in the same space, but you have to check that everything still fits comfortably.

  • Build a tiny Flask API that serves the quantized model locally. Create an endpoint, load the model once at startup, and return predictions for incoming JSON payloads. This mirrors setting up a personal Google Maps server: you’ve got a local route planner that works without needing the cloud.

  • Tip: Keep a requirements.txt handy so you can reinstall the same environment on another machine.

  • Cheat sheet: python -m timeit -s "import torch; m=...; inp=..." "m(inp)" quickly shows speed gains.

💬 Got stuck or discovered a faster trick? Drop a comment below – I’d love to hear your experience!



About the Author

Abdullah Sheikh is the Founder & CEO at Exteed, where he leads a team of skilled developers specializing in Web2 and Web3 applications, Custom Smart Contracts, and Blockchain solutions.

With 6+ years of experience, Abdullah has built CRMs, Crypto Wallets, DeFi Exchanges, E-Commerce Stores, HIPAA Compliant EMR Systems, and AI-powered systems that drive business efficiency and innovation.

His expertise spans Blockchain, Crypto & Tokenomics, Artificial Intelligence, and Web Applications; building reliable and smooth web apps that fit the client’s goals and requirements.

📧 info@abdullah-sheikh.com · 🔗 LinkedIn · 🌐 abdullah-sheikh.com

Top comments (0)