What it is

LocalForge is a self-hosted AI control plane. It exposes a single OpenAI-compatible endpoint and handles everything else — model lifecycle, intelligent routing, memory, and finetuning.

# Your app stays the same. Just change base_url.
client = openai.OpenAI(base_url="http://localhost:8010/v1", api_key="lf-xxx")
response = client.chat.completions.create(model="auto", messages=[...])

How the router works

When you send model: "auto", the routing engine:

Classifies the query — TF-IDF + Logistic Regression, under 5ms, into coding / math / reasoning / instruction / general

Scores each model using:

Benchmark scores from HuggingFace (MMLU-Pro, HumanEval, GSM8K) — 40%
Vector memory of past query→outcome pairs stored in Qdrant — 30%
Measured latency on your hardware — 15%
Thumbs up/down feedback — 15%

Falls back to cloud (OpenAI/Gemini) if confidence < 0.3

The memory layer uses nomic-embed-text-v1.5 to embed every query locally. Similar past queries are retrieved at routing time, and scores decay exponentially (λ = 0.95) so fresh failures hurt more than old ones.

VRAM lifecycle

Consumer GPUs can only hold 1–2 models at a time. LocalForge manages atomic state transitions:

UNLOADED → LOADING → HOT → UNLOADING → UNLOADED

Requests queue during model swaps. The "Resident Model" (most-used in the past 24h) is prioritized to stay loaded.

Finetuning pipeline

Upload CSV or JSONL dataset via the dashboard

Pick base model + hyperparameters

Training runs in an isolated subprocess via Unsloth (2× faster, 60% less VRAM)

Live loss curves stream to the browser via SSE

On completion: LoRA adapters merged → GGUF exported → model auto-registered in the router

Layer	Tech
Backend	FastAPI + aiosqlite (WAL)
Frontend	Next.js 16 + React 19
Inference	llama-cpp-python
Vector store	Qdrant (disk, no Docker)
Embeddings	nomic-embed-text-v1.5
Finetuning	Unsloth / PEFT + TRL
Classifier	scikit-learn TF-IDF + LogReg

Layer

Tech

Backend

FastAPI + aiosqlite (WAL)

Frontend

Next.js 16 + React 19

Inference

llama-cpp-python

Vector store

Qdrant (disk, no Docker)

Embeddings

nomic-embed-text-v1.5

Finetuning

Unsloth / PEFT + TRL

Classifier

scikit-learn TF-IDF + LogReg

GitHub

al1-nasir / LocalForge

Self-hosted AI control plane for intelligent local LLM orchestration. OpenAI-compatible API · ML-powered multi-model routing · LoRA finetuning · vector memory · RAG

⚡ LocalForge

Self-Hosted AI Control Plane for Intelligent Local LLM Orchestration

A production-grade platform for running, routing, benchmarking, and finetuning local LLMs.
Drop-in OpenAI-compatible API · Intelligent multi-model routing · LoRA finetuning with live monitoring

Overview

LocalForge is a self-hosted AI control plane that transforms your GPU workstation into an intelligent LLM serving infrastructure. Instead of manually managing model files, writing inference scripts, and guessing which model fits which task — LocalForge automates the entire lifecycle:

Browse & Download GGUF models from HuggingFace with automatic VRAM compatibility filtering
Serve models via a fully OpenAI-compatible /v1/chat/completions endpoint
Route queries to the optimal model using ML-powered task classification + multi-signal scoring
Learn from usage patterns via a vector-based memory layer that improves routing over time
Benchmark models against standard evaluations (MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench)
Finetune models with LoRA/QLoRA via a managed subprocess pipeline with live loss streaming
Augment responses with a…