inferbench: download, launch & benchmark local LLM engines from one desktop app

#llm #localllm #benchmark #ai

If you run LLMs locally, you've probably bounced between half a dozen tools: one to download a model, another to launch the engine, a third to figure out how many tokens/sec you're actually getting on your GPU. inferbench collapses that into a single desktop app.

What it does

Download models and inference engines (llama.cpp & friends) from one place.
Launch an engine against a model with the right flags, no terminal archaeology.
Benchmark real throughput on your hardware — actual tok/s, not marketing numbers. No simulated data: if an engine isn't available, you get an error, not a guess.
Serve & expose over MCP — keep a model resident and expose it to any MCP client over stdio or HTTP. Works for text and image models (Stable Diffusion via sd.cpp).

Why local-first

No cloud, no API keys, no per-token bill, no data leaving your machine. You see exactly what your own GPU can do — useful when you're picking a model for a real workload and need honest numbers.

In a recent smoke test, Qwen2.5-7B hit ~75 tok/s on an RTX 3070 end-to-end through inferbench.

Stack

React + Vite + Electron on the front, Python 3.11 + FastAPI + SQLModel on the back, packaged with a PyInstaller sidecar. Cross-checked model catalog (124 models) verified against Hugging Face.

Try it

https://github.com/JoniMartin27/inferbench

v0.1.1 is out now. Feedback and issues welcome — especially benchmark numbers from hardware I don't have. 🖥️

DEV Community

inferbench: download, launch & benchmark local LLM engines from one desktop app

What it does

Why local-first

Stack

Try it

Top comments (0)