DEV Community

Cover image for LocalForge: I built a self-hosted LLM control plane with intelligent routing and LoRA finetuning
Muhammad Ali Nasir
Muhammad Ali Nasir

Posted on

LocalForge: I built a self-hosted LLM control plane with intelligent routing and LoRA finetuning

Running local LLMs is easy. Running them well in a real application is not.

You end up with fragile inference scripts, no idea which model fits which task, manual VRAM calculations, and zero observability into what's actually happening. I got tired of it, so I built LocalForge

What it is

LocalForge is a self-hosted AI control plane. It exposes a single OpenAI-compatible endpoint and handles everything else — model lifecycle, intelligent routing, memory, and finetuning.

# Your app stays the same. Just change base_url.
client = openai.OpenAI(base_url="http://localhost:8010/v1", api_key="lf-xxx")
response = client.chat.completions.create(model="auto", messages=[...])
Enter fullscreen mode Exit fullscreen mode

How the router works

When you send model: "auto", the routing engine:

  1. Classifies the query — TF-IDF + Logistic Regression, under 5ms, into coding / math / reasoning / instruction / general
  2. Scores each model using:
    • Benchmark scores from HuggingFace (MMLU-Pro, HumanEval, GSM8K) — 40%
    • Vector memory of past query→outcome pairs stored in Qdrant — 30%
    • Measured latency on your hardware — 15%
    • Thumbs up/down feedback — 15%
  3. Falls back to cloud (OpenAI/Gemini) if confidence < 0.3

The memory layer uses nomic-embed-text-v1.5 to embed every query locally. Similar past queries are retrieved at routing time, and scores decay exponentially (λ = 0.95) so fresh failures hurt more than old ones.

VRAM lifecycle

Consumer GPUs can only hold 1–2 models at a time. LocalForge manages atomic state transitions:

UNLOADED → LOADING → HOT → UNLOADING → UNLOADED
Enter fullscreen mode Exit fullscreen mode

Requests queue during model swaps. The "Resident Model" (most-used in the past 24h) is prioritized to stay loaded.

Finetuning pipeline

  • Upload CSV or JSONL dataset via the dashboard
  • Pick base model + hyperparameters
  • Training runs in an isolated subprocess via Unsloth (2× faster, 60% less VRAM)
  • Live loss curves stream to the browser via SSE
  • On completion: LoRA adapters merged → GGUF exported → model auto-registered in the router

Tech stack

Layer Tech
Backend FastAPI + aiosqlite (WAL)
Frontend Next.js 16 + React 19
Inference llama-cpp-python
Vector store Qdrant (disk, no Docker)
Embeddings nomic-embed-text-v1.5
Finetuning Unsloth / PEFT + TRL
Classifier scikit-learn TF-IDF + LogReg

GitHub

GitHub logo al1-nasir / LocalForge

Self-hosted AI control plane for intelligent local LLM orchestration. OpenAI-compatible API · ML-powered multi-model routing · LoRA finetuning · vector memory · RAG

LocalForge Header

Python 3.10+ Next.js 16 FastAPI MIT License

⚡ LocalForge

Self-Hosted AI Control Plane for Intelligent Local LLM Orchestration

A production-grade platform for running, routing, benchmarking, and finetuning local LLMs.
Drop-in OpenAI-compatible API · Intelligent multi-model routing · LoRA finetuning with live monitoring


Overview

LocalForge is a self-hosted AI control plane that transforms your GPU workstation into an intelligent LLM serving infrastructure. Instead of manually managing model files, writing inference scripts, and guessing which model fits which task — LocalForge automates the entire lifecycle:

  1. Browse & Download GGUF models from HuggingFace with automatic VRAM compatibility filtering
  2. Serve models via a fully OpenAI-compatible /v1/chat/completions endpoint
  3. Route queries to the optimal model using ML-powered task classification + multi-signal scoring
  4. Learn from usage patterns via a vector-based memory layer that improves routing over time
  5. Benchmark models against standard evaluations (MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench)
  6. Finetune models with LoRA/QLoRA via a managed subprocess pipeline with live loss streaming
  7. Augment responses with a…




Built by Ali Nasir — alinasir.me · LinkedIn

Would love feedback on the routing architecture in particular!

Top comments (0)