Trying On-Device LLM Inference on Windows with Python

#python #ai #llm #windows

Cloud-based language models are widely used, but running models on-device can help reduce latency, recurring API costs, and data privacy concerns.

Below is a minimal example of running a compressed large language model on a Windows machine using picoLLM.

Why Run Models On-Device?

Running models locally can:

keep data on the device
avoid network latency

At the same time, local inference introduces challenges such as hardware constraints and model optimization. picoLLM makes it easier to run compressed open-weight models across platforms.

Setup

Install Python:

https://www.python.org/downloads/
Install picoLLM:

pip install picollm

Get an AccessKey and download a model from: https://console.picovoice.ai/

picoLLM supports models such as Llama, Gemma, Mixtral, Mistral, and Phi, and runs across Windows, macOS, Linux, Raspberry Pi, mobile, and browsers.

Minimal Python Example

Import the package and initialize the engine:

import picollm

pllm = picollm.create(
    access_key,
    model_path
)

Generate a completion:

res = pllm.generate(prompt="what is the air-speed velocity of an unladen swallow?")
print(res.completion)

Streaming tokens:

res = pllm.generate(
    prompt="what is the air-speed velocity of an unladen swallow?",
    stream_callback=lambda x: print(x, flush=True, end="")
)

Release the engine when finished:

pllm.release()

Node.js Example

The same idea in Node.js:

const { PicoLLM } = require("@picovoice/picollm-node");

const pllm = new PicoLLM(accessKey, modelPath);

const res = await pllm.generate(
  "what is the air-speed velocity of an unladen swallow?",
  {
    streamCallback: (token) => process.stdout.write(token)
  }
);

pllm.release();