Running local LLMs is easy. Running them well in a real application is not.
You end up with fragile inference scripts, no idea which model fits which task, manual VRAM calculations, and zero observability into what's actually happening. I got tired of it, so I built LocalForge
What it is
LocalForge is a self-hosted AI control plane. It exposes a single OpenAI-compatible endpoint and handles everything else — model lifecycle, intelligent routing, memory, and finetuning.
# Your app stays the same. Just change base_url.
client = openai.OpenAI(base_url="http://localhost:8010/v1", api_key="lf-xxx")
response = client.chat.completions.create(model="auto", messages=[...])
How the router works
When you send model: "auto", the routing engine:
- Classifies the query — TF-IDF + Logistic Regression, under 5ms, into coding / math / reasoning / instruction / general
-
Scores each model using:
- Benchmark scores from HuggingFace (MMLU-Pro, HumanEval, GSM8K) — 40%
- Vector memory of past query→outcome pairs stored in Qdrant — 30%
- Measured latency on your hardware — 15%
- Thumbs up/down feedback — 15%
- Falls back to cloud (OpenAI/Gemini) if confidence < 0.3
The memory layer uses nomic-embed-text-v1.5 to embed every query locally. Similar past queries are retrieved at routing time, and scores decay exponentially (λ = 0.95) so fresh failures hurt more than old ones.
VRAM lifecycle
Consumer GPUs can only hold 1–2 models at a time. LocalForge manages atomic state transitions:
UNLOADED → LOADING → HOT → UNLOADING → UNLOADED
Requests queue during model swaps. The "Resident Model" (most-used in the past 24h) is prioritized to stay loaded.
Finetuning pipeline
- Upload CSV or JSONL dataset via the dashboard
- Pick base model + hyperparameters
- Training runs in an isolated subprocess via Unsloth (2× faster, 60% less VRAM)
- Live loss curves stream to the browser via SSE
- On completion: LoRA adapters merged → GGUF exported → model auto-registered in the router
Tech stack
| Layer | Tech |
|---|---|
| Backend | FastAPI + aiosqlite (WAL) |
| Frontend | Next.js 16 + React 19 |
| Inference | llama-cpp-python |
| Vector store | Qdrant (disk, no Docker) |
| Embeddings | nomic-embed-text-v1.5 |
| Finetuning | Unsloth / PEFT + TRL |
| Classifier | scikit-learn TF-IDF + LogReg |
GitHub
al1-nasir
/
LocalForge
Self-hosted AI control plane for intelligent local LLM orchestration. OpenAI-compatible API · ML-powered multi-model routing · LoRA finetuning · vector memory · RAG
⚡ LocalForge
Self-Hosted AI Control Plane for Intelligent Local LLM Orchestration
A production-grade platform for running, routing, benchmarking, and finetuning local LLMs.
Drop-in OpenAI-compatible API · Intelligent multi-model routing · LoRA finetuning with live monitoring
Overview
LocalForge is a self-hosted AI control plane that transforms your GPU workstation into an intelligent LLM serving infrastructure. Instead of manually managing model files, writing inference scripts, and guessing which model fits which task — LocalForge automates the entire lifecycle:
- Browse & Download GGUF models from HuggingFace with automatic VRAM compatibility filtering
-
Serve models via a fully OpenAI-compatible
/v1/chat/completionsendpoint - Route queries to the optimal model using ML-powered task classification + multi-signal scoring
- Learn from usage patterns via a vector-based memory layer that improves routing over time
- Benchmark models against standard evaluations (MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench)
- Finetune models with LoRA/QLoRA via a managed subprocess pipeline with live loss streaming
- Augment responses with a…
Built by Ali Nasir — alinasir.me · LinkedIn
Would love feedback on the routing architecture in particular!

Top comments (0)