Lightning Developer

Posted on Jun 12

Finding the Right Local LLM for Your PC: A Practical Guide with whichllm

#ai #webdev #rag #pinggy

The Challenge of Picking a Local AI Model

The open-source AI ecosystem is expanding at an incredible pace. New language models appear almost every week, each promising better reasoning, coding, or conversational abilities. At the same time, hardware requirements vary significantly between models, making the selection process more complicated than simply downloading the most popular option.

A model that runs smoothly on one machine may struggle on another. Available VRAM, system memory, quantization settings, and processor capabilities all influence the experience. As a result, many users spend hours comparing benchmarks, reading community recommendations, and testing multiple models before finding one that works well.

This is exactly the problem that whichllm aims to solve.

What Is whichllm?

whichllm is an open-source command-line utility that recommends local language models based on the hardware available on your system. Instead of relying on generic rankings or parameter counts, it analyzes your machine and suggests models that balance quality, speed, and resource requirements.

The tool automatically evaluates:

GPU specifications
Available VRAM
CPU resources
System memory
Storage capacity

Using this information, it generates recommendations that are practical for your specific setup rather than theoretical best-case scenarios.

Why Model Selection Is Often Difficult

Running a local language model typically involves three separate decisions.

First, you need to determine which models can realistically fit within your hardware limits.

Second, among the models that fit, you must identify which one actually performs best.

Finally, you need to deploy and run the chosen model.

Most modern tools focus on deployment. Determining compatibility and evaluating quality usually requires manual research. Benchmark sites, discussion forums, and community recommendations can help, but they often become outdated quickly as new models are released.

whichllm simplifies the first two stages by combining hardware analysis with benchmark-based recommendations.

How whichllm Evaluates Models

The recommendation system goes beyond simple hardware matching. It incorporates data from multiple benchmark sources to create a broader assessment of model quality.

The evaluation process considers information from sources such as:

LiveBench
Artificial Analysis Index
Chatbot Arena rankings
Open LLM Leaderboard
Coding-focused benchmarks

Rather than relying on a single score, whichllm combines multiple signals into a composite ranking. Newer models are also given appropriate consideration so that outdated benchmark results do not dominate recommendations.

Another important factor is quantization. A model that barely fits in memory using an aggressive quantization level may provide a worse experience than a slightly smaller model running at a higher-quality quantization. The ranking system attempts to account for these tradeoffs.

Installing whichllm

One of the easiest ways to run the tool is through uv:

uvx whichllm@latest

For a permanent installation:

uv tool install whichllm

Alternatively:

pip install whichllm

Once installed, launch it with:

whichllm

The tool will inspect the available hardware and generate a list of recommended models.

A typical output includes hardware information along with model rankings, estimated throughput, quantization recommendations, and overall quality scores.

Useful Commands

whichllm includes several commands beyond basic recommendations.

Simulate recommendations for another GPU:

whichllm --gpu "RTX 5090"

Compare upgrade possibilities:

whichllm upgrade "RTX 4090" "RTX 5090"

Launch a recommended model directly:

whichllm run "qwen3.6"

Generate structured JSON output:

whichllm --json

Users planning future hardware purchases can also estimate what components are required for specific models.

Running the Recommended Model with Ollama

After selecting a model, the next step is deployment. Ollama provides one of the simplest ways to run quantized models locally.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Download a recommended model:

ollama pull qwen3:27b-q5_k_m

Start the inference server:

ollama serve

The API becomes available locally on port 11434.

To verify that the server is running correctly:

curl http://localhost:11434/api/tags

You can also generate a quick response using:

curl http://localhost:11434/api/generate -d '{
  "model":"qwen3:27b-q5_k_m",
  "prompt":"Explain quantization in simple terms.",
  "stream":false
}'

Accessing a Local LLM from Anywhere

By default, Ollama listens only on localhost, which means other devices cannot access the model directly.

This can become limiting when:

Collaborating with team members
Testing applications from mobile devices
Connecting cloud applications to local AI infrastructure
Demonstrating projects remotely

One approach is creating a secure tunnel that exposes the local API through a public HTTPS endpoint.

For example:

ssh -p 443 -R0:localhost:11434 -t qr@free.pinggy.io "u:Host:localhost:11434"

After the tunnel is established, requests sent to the generated URL are forwarded to the local Ollama instance.

This allows OpenAI-compatible clients, development tools, and applications to communicate with the locally hosted model without deploying it to a cloud server.

Adding Authentication

Security becomes important once an endpoint is publicly accessible.

A token can be added during tunnel creation:

ssh -p 443 -R0:localhost:11434 -t qr@free.pinggy.io "u:Host:localhost:11434" "k:mysecrettoken"

Requests must then include:

Authorization: Bearer mysecrettoken

This prevents unauthorized usage and helps protect local computing resources.

Limitations to Keep in Mind

Although whichllm significantly simplifies model selection, it is not intended to be a universal AI recommendation engine.

A few limitations include:

Private models are not evaluated.
Very recent releases may not have enough benchmark data.
Throughput estimates remain approximations.
Specialized audio and embedding models are outside its primary focus.

These limitations stem largely from available benchmark information rather than the tool itself.

Conclusion

Choosing a local language model has traditionally involved a combination of guesswork, experimentation, and community recommendations. While that process can be educational, it is often time-consuming and inconsistent.

whichllm introduces a more structured approach by evaluating actual hardware capabilities and combining them with benchmark-driven rankings. Instead of wondering which model might work best, users receive recommendations tailored to the resources available on their machines.

Combined with Ollama for deployment, the workflow becomes remarkably simple: analyze hardware, select a model, download it, and start serving it locally. For anyone exploring self-hosted AI, that removes much of the friction that previously stood between curiosity and practical experimentation.

Reference

whichllm: One Command to Find the Best Local LLM for Your Hardware

whichllm auto-detects your GPU, CPU, and RAM, then ranks local models by real benchmarks rather than parameter count. Here's how to use it, run your pick with Ollama, and share it remotely via Pinggy.

pinggy.io