DEV Community

Jonathan Martin Paez
Jonathan Martin Paez

Posted on • Originally published at github.com

inferbench: download, launch & benchmark local LLM engines from one desktop app

If you run LLMs locally, you've probably bounced between half a dozen tools: one to download a model, another to launch the engine, a third to figure out how many tokens/sec you're actually getting on your GPU. inferbench collapses that into a single desktop app.

What it does

  • Download models and inference engines (llama.cpp & friends) from one place.
  • Launch an engine against a model with the right flags, no terminal archaeology.
  • Benchmark real throughput on your hardware — actual tok/s, not marketing numbers. No simulated data: if an engine isn't available, you get an error, not a guess.
  • Serve & expose over MCP — keep a model resident and expose it to any MCP client over stdio or HTTP. Works for text and image models (Stable Diffusion via sd.cpp).

Why local-first

No cloud, no API keys, no per-token bill, no data leaving your machine. You see exactly what your own GPU can do — useful when you're picking a model for a real workload and need honest numbers.

In a recent smoke test, Qwen2.5-7B hit ~75 tok/s on an RTX 3070 end-to-end through inferbench.

Stack

React + Vite + Electron on the front, Python 3.11 + FastAPI + SQLModel on the back, packaged with a PyInstaller sidecar. Cross-checked model catalog (124 models) verified against Hugging Face.

Try it

https://github.com/JoniMartin27/inferbench
Enter fullscreen mode Exit fullscreen mode

v0.1.1 is out now. Feedback and issues welcome — especially benchmark numbers from hardware I don't have. 🖥️

Top comments (0)