Hossein Mortazavi

Posted on Dec 3

Running Microsoft's Phi-3 on CPU with Rust & Candle

#rust #mlops #ai #python

Python is currently the best tool for training machine learning models (AI), with many tools available in the Python ecosystem - like PyTorch and Hugging Face Transformers - it has never been easier to create a proof of concept for an AI model. We are big fans of Python because of how easy and flexible it is.

However, let's take a look at the deployment aspect of using the model.

When it comes to inference (running a trained model) on platforms where you need to run a model in a production environment or on a device that is not a computer, Python starts to show its disadvantages.

If you've tried deploying a PyTorch application before, then you already know the pain points involved in the process. You have to create a multi-gigabyte Docker container image just to run a simple Python script. The cold-starting time associated with a large interpreter is very slow, and the memory required to run these types of models makes it impossible to run on most consumer-level CPUs or edge devices.

Is it possible to benefit from all of the advantages of modern large language models (LLMs) and deploy them with significantly less overhead?

This is where Rust and Candle come into play.

Candle is a minimalistic machine learning (ML) framework created by Hugging Face. You can utilize Candle and run state-of-the-art AI models without having to depend on Python. When we combine Rust's ability to provide memory safety and speed with Microsoft Phi-3 (an advanced small language model), we can reach an entirely different level of performance.

In this guide, I will show you how to:

Ditch the heavy PyTorch dependency.

Load a quantized Phi-3 model directly in Rust.

Build a standalone, lightweight CLI tool that runs blazing-fast inference on a standard CPU.

No GPU required. No 5GB Docker images. Just pure, high-performance Rust.

Let's dive in.

*Step 1: Setting Up the Project
*
First, let's create a new Rust project. Open your terminal and run:
cargo new rust-phi3-cpu cd rust-phi3-cpu

Next, we need to add the Candle stack to our Cargo.toml. Since we are focusing on CPU inference, we will use the quantized features to keep the model lightweight and fast.

Open Cargo.toml and add the following dependencies:

[package]
name = "rust-phi3"
version = "0.1.0"
edition = "2021"

[dependencies]
anyhow = "1.0"
tokenizers = "0.19.1"
clap = { version = "4.4", features = ["derive"] }

candle-core = { git = "https://github.com/huggingface/candle.git", branch = "main" }
candle-transformers = { git = "https://github.com/huggingface/candle.git", branch = "main" }
candle-nn = { git = "https://github.com/huggingface/candle.git", branch = "main" }

We are using candle-transformers which has built-in support for GGUF (quantized) models. This is the secret sauce for running heavy models on a CPU efficiently.

*Step 2: The Implementation
*
Now, open src/main.rs. We are going to build a simple CLI that accepts a prompt and generates text.

The logic is straightforward:

Load the Model: Read the .gguf file (Phi-3 Mini).

Tokenize: Convert the user's prompt into numbers (tokens).

Inference Loop: Feed tokens into the model one by one to generate the next word.

Here is the main.rs

use anyhow::{Error as E, Result};
use clap::Parser;
use candle_transformers::models::quantized_phi3 as model; // Using Phi-3 specific module
use candle_core::{Tensor, Device};
use candle_core::quantized::gguf_file;
use tokenizers::Tokenizer;
use std::io::Write;

#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
    /// The prompt to run inference with
    #[arg(short, long, default_value = "Physics is fun. Explain quantum physics to a 5-year-old in simple words:")]
    prompt: String,

    /// Path to the GGUF model file
    #[arg(long, default_value = "Phi-3-mini-4k-instruct-q4.gguf")]
    model_path: String,
}

fn main() -> Result<()> {
    let args = Args::parse();

    println!("Loading model from: {}", args.model_path);

    // 1. Setup Device (CPU)
    let device = Device::Cpu;

    // 2. Load the GGUF Model
    let mut file = std::fs::File::open(&args.model_path)
        .map_err(|_| E::msg(format!("Could not find model file at {}. Did you download it?", args.model_path)))?;

    // Read the GGUF content
    let content = gguf_file::Content::read(&mut file)?;

    // Load model weights (Flash Attention set to false for CPU)
    let mut model = model::ModelWeights::from_gguf(false, content, &mut file, &device)?;

    // 3. Load Tokenizer
    println!("Loading tokenizer...");
    let tokenizer = Tokenizer::from_file("tokenizer.json").map_err(E::msg)?;

    // 4. Encode Prompt
    let tokens = tokenizer.encode(args.prompt, true).map_err(E::msg)?;
    let prompt_tokens = tokens.get_ids();
    let mut all_tokens = prompt_tokens.to_vec();

    // 5. Inference Loop
    println!("Generating response...\n");
    let mut to_generate = 100; // Generate 100 tokens max
    let mut logits_processor = candle_transformers::generation::LogitsProcessor::new(299792458, None, None);

    print!("Response: ");
    std::io::stdout().flush()?;

    let mut next_token = *prompt_tokens.last().unwrap();

    for _ in 0..to_generate {
        let input = Tensor::new(&[next_token], &device)?.unsqueeze(0)?;
        let logits = model.forward(&input, all_tokens.len())?;
        let logits = logits.squeeze(0)?;

        next_token = logits_processor.sample(&logits)?;
        all_tokens.push(next_token);

        if let Some(t) = tokenizer.decode(&[next_token], true).ok() {
            print!("{}", t);
            std::io::stdout().flush()?;
        }
    }

    println!("\n\nDone!");
    Ok(())
}

*Step 3: Getting the Model Weights
*
Before we run this, we need the actual brain of the AI. Since we are optimizing for CPU, we will use the GGUF format (Quantized).

You can download the Phi-3-Mini-4k-Instruct (Quantized) from Hugging Face. Look for the q4_k_m.gguf version (approx. 2.3 GB).

Model: Phi-3-mini-4k-instruct-q4.gguf

Tokenizer: tokenizer.json (from the official repo)

*Step 4: Running the Demo
*
This is the moment of truth.

Before running, make sure to compile in release mode. Rust’s debug builds are optimized for debugging, not speed. For AI inference, the difference is massive (often 10x-100x slower in debug mode).

Run the following command in your terminal:
cargo run --release -- --model-path "Phi-3-mini-4k-instruct-q4.gguf" --prompt "whatever you want:"

What happens next? You won't see a long delay while a Python interpreter spins up or massive libraries import. Instead, within milliseconds, you’ll see the tokens streaming to your console.

On my local machine (Surface / intel 8100Y), I am getting a smooth stream of tokens, fully generated on the CPU. No heavy lifting required.

The model weights file size is the same for both, but the deployment artifact size (the code + runtime) is drastically smaller in Rust.

This makes Rust an ideal candidate for Edge AI, IoT devices, or Serverless functions where startup time and memory footprint are critical costs.

Does this mean Python is dead? Absolutely not.

Python remains the best ecosystem for training, experimenting, and data science. However, when it's time to take that model and ship it to production—especially in resource-constrained environments—Rust is a superpower.

By using frameworks like Candle, we can run modern LLMs like Phi-3 on standard CPUs with incredible efficiency. We get the safety and speed of Rust without sacrificing the capabilities of modern AI.

DEV Community

Running Microsoft's Phi-3 on CPU with Rust & Candle

Top comments (0)