DEV Community

Khushi Nakra
Khushi Nakra

Posted on

Trying On-Device LLM Inference on Windows with Python

Cloud-based language models are widely used, but running models on-device can help reduce latency, recurring API costs, and data privacy concerns.

Below is a minimal example of running a compressed large language model on a Windows machine using picoLLM.

Why Run Models On-Device?

Running models locally can:

  • keep data on the device
  • avoid network latency

At the same time, local inference introduces challenges such as hardware constraints and model optimization. picoLLM makes it easier to run compressed open-weight models across platforms.

Setup

  1. Install Python:

    https://www.python.org/downloads/

  2. Install picoLLM:

pip install picollm
Enter fullscreen mode Exit fullscreen mode
  1. Get an AccessKey and download a model from: https://console.picovoice.ai/

picoLLM supports models such as Llama, Gemma, Mixtral, Mistral, and Phi, and runs across Windows, macOS, Linux, Raspberry Pi, mobile, and browsers.

Minimal Python Example

Import the package and initialize the engine:

import picollm

pllm = picollm.create(
    access_key,
    model_path
)
Enter fullscreen mode Exit fullscreen mode

Generate a completion:

res = pllm.generate(prompt="what is the air-speed velocity of an unladen swallow?")
print(res.completion)
Enter fullscreen mode Exit fullscreen mode

Streaming tokens:

res = pllm.generate(
    prompt="what is the air-speed velocity of an unladen swallow?",
    stream_callback=lambda x: print(x, flush=True, end="")
)
Enter fullscreen mode Exit fullscreen mode

Release the engine when finished:

pllm.release()
Enter fullscreen mode Exit fullscreen mode

Node.js Example

The same idea in Node.js:

const { PicoLLM } = require("@picovoice/picollm-node");

const pllm = new PicoLLM(accessKey, modelPath);

const res = await pllm.generate(
  "what is the air-speed velocity of an unladen swallow?",
  {
    streamCallback: (token) => process.stdout.write(token)
  }
);

pllm.release();
Enter fullscreen mode Exit fullscreen mode

Additional Resources

For a full step-by-step walkthrough and detailed explanation, see the original guide:

https://picovoice.ai/blog/how-to-run-a-local-llm-on-windows/

Top comments (0)