Cloud-based language models are widely used, but running models on-device can help reduce latency, recurring API costs, and data privacy concerns.
Below is a minimal example of running a compressed large language model on a Windows machine using picoLLM.
Why Run Models On-Device?
Running models locally can:
- keep data on the device
- avoid network latency
At the same time, local inference introduces challenges such as hardware constraints and model optimization. picoLLM makes it easier to run compressed open-weight models across platforms.
Setup
Install Python:
https://www.python.org/downloads/Install picoLLM:
pip install picollm
- Get an AccessKey and download a model from: https://console.picovoice.ai/
picoLLM supports models such as Llama, Gemma, Mixtral, Mistral, and Phi, and runs across Windows, macOS, Linux, Raspberry Pi, mobile, and browsers.
Minimal Python Example
Import the package and initialize the engine:
import picollm
pllm = picollm.create(
access_key,
model_path
)
Generate a completion:
res = pllm.generate(prompt="what is the air-speed velocity of an unladen swallow?")
print(res.completion)
Streaming tokens:
res = pllm.generate(
prompt="what is the air-speed velocity of an unladen swallow?",
stream_callback=lambda x: print(x, flush=True, end="")
)
Release the engine when finished:
pllm.release()
Node.js Example
The same idea in Node.js:
const { PicoLLM } = require("@picovoice/picollm-node");
const pllm = new PicoLLM(accessKey, modelPath);
const res = await pllm.generate(
"what is the air-speed velocity of an unladen swallow?",
{
streamCallback: (token) => process.stdout.write(token)
}
);
pllm.release();
Additional Resources
- Python API docs: https://picovoice.ai/docs/api/picollm-python/
- Node.js API docs: https://picovoice.ai/docs/api/picollm-nodejs/
- Demo repository: https://github.com/Picovoice/picollm/tree/main/demo
For a full step-by-step walkthrough and detailed explanation, see the original guide:
https://picovoice.ai/blog/how-to-run-a-local-llm-on-windows/
Top comments (0)