Dhanush B

Posted on May 22

Local‑LLM “OpenAI‑Compatible” Platform – Design Doc

Authors: Dhanush

Date: 2025‑05‑22

1. Overview

Provide an OpenAI‑compatible REST/WS endpoint backed by self‑hosted LLMs that supports:

• Low‑latency inference

• Hot‑reload LoRA adapters for continual fine‑tune

• Optional RAG retrieval

• Multi‑tenant data isolation

2. Goals / Non-Goals

Goals	Non-Goals
Drop-in replacement for `chat/completions`, `embeddings`, `fine-tunes`	Training giant base models from scratch
Sub-second P90 latency for ≤4k context (7B-13B params)	Supporting 70B+ models in v1
Fine-tune on new data ≤15 min turnaround, hot-swap without downtime	Human RLHF pipeline
RAG over customer docs (S3 / SharePoint / Git)	Automatic doc-chunking heuristics

3. Background / References

Explosion of local-LLM serving projects (vLLM, Ollama, OpenLLM, …)
Need for data residency + PII control prohibits external APIs
Continual-learning vs. RAG comparison
Template sources: Google Eng Design Doc, UCL HLD template, OpenAI-compatible server specs

4. High‑Level Design

Component Diagram

[Client SDK]──HTTPS──▶[API Gateway & Auth]
                         │
                         ▼
         ┌──▶[OpenAI-Shim Service (FastAPI)]◄───┐
         │             │                        │
         │             ▼                        │
         │      [Inference Cluster]            │
         │        • vLLM workers               │
         │        • GPU autoscale              │
         │             │                       │
         │             ├──▶[Vector DB (pgvector/Qdrant)]  (RAG)
         │             │
         │             └──▶[LoRA Adapter FS + Hot-Reload]
         │
         └──▶[Async Job Orchestrator (Airflow/K8s-Cron)]
                         │
                         ├──▶[Fine-Tune Trainer (Q-LoRA)]
                         │       │
                         │       └──▶[Feature Store / Data Lake]
                         │
                         └──▶[Model Registry (MLflow)]

5. Low‑Level Design

API Spec Snippet

POST /v1/chat/completions
headers:
  Authorization: Bearer <token>
  OpenAI-Tenant: acme
json:
  model: acme/gemma-7b-lora-v12
  messages:
    - role: user
      content: "Explain RAG."
  rag:
    enabled: true
    collection: "acme_docs"

Storage Layout

chat-logs/tenant/date/file.ndjson  
adapters/tenant/model/ver/adapter.safetensors
rag/tenant/chunks/{uuid}.parquet

6. Sequence Flow (Chat Request)

Client   -> Gateway   : POST /chat/completions
Gateway  -> Shim      : validate & forward
Shim     -> vLLM      : generate tokens
vLLM     -> Shim      : stream back
Shim     -> Gateway   : wrap as SSE
Gateway  -> Client    : deliver chunked response

7. Observability

Tracing: OpenTelemetry spans from Gateway → Shim → vLLM
Metrics: token/s, GPU util, P99 latency, adapter-reload success
Logs: structured JSON; redaction for sensitive content
Eval harness: nightly regression on MMLU subset + toxicity tests

8. Data Flow

Chat Logs → PII scrubber → S3
Weekly fine-tune via DAG → feature store → adapter training
RAG pipeline fetches user docs → chunks → embeddings → vector DB

9. Security / Multi-Tenancy

JWT + tenant ID enforced on every layer
Data stored in isolated S3 prefixes, KMS-encrypted
Full deletion of RAG index + adapter on tenant account delete

10. Scalability & HA

Stateless vLLM workers behind GPU autoscaler
Gateway & shim deploy in 3x HA
Redis + Vector DB (Qdrant/pgvector) support horizontal sharding

11. Key Trade-offs

Decision	Rationale	Cons
vLLM vs. llama.cpp	GPU perf, OpenAI API ready	Needs Nvidia/ROCm
LoRA adapters hot-swap	No downtime, small files	Accumulating adapters per tenant (disk)
pgvector inside Postgres	One stack, ACID deletes	Lower recall vs. Milvus

12. Alternatives Considered

Fully serverless (Lambda + Fireworks) – couldn’t meet residency or latency goals
Prefix-tuning at inference – added latency
LangChain VectorRouter – didn’t support tenant-aware auth

13. Failure Modes & Mitigations

Failure	Detection	Mitigation
GPU OOM / restart	Prometheus alert	K8s restarts pod; retry in queue
Model update crash	Reload fails	Fallback to previous adapter
Abuse or prompt injection	Spike in toxic tokens	API key ban / logging / moderation filters

14. Milestones

+2w MVP inference
+6w fine‑tune pipeline
+10w RAG GA
+12w SRE hand‑over

🔎 Where It Fits in the Market

🎯 1. Data-Sensitive Enterprises

Your value prop: “You get ChatGPT-like power, but no data ever leaves your infra.”

Who are they?

Banks, insurance firms, governments, healthcare, defense
AI-inclined startups building privacy-sensitive SaaS
Fortune 500s building internal copilots (e.g., compliance bots, legal assistants)

Why your platform fits:

They can't send data to OpenAI or even Hugging Face hosted endpoints
They want OpenAI-compatible APIs to plug into existing apps
They need PII-compliant fine-tuning or RAG over internal documents
They want retrainable models without shipping data off-prem

💼 2. Private LLM-as-a-Service for Mid-Market & Vertical SaaS

“Give every vertical SaaS company a secure, updatable AI brain.”

Think:

Legal tech (custom contracts AI)
Medical record copilots
ERP / CRM with internal LLM plugins
Recruiting platforms with resume-screening

These companies:

Don’t have ML teams
Can’t build infra like vLLM + LoRA adapters + vector search
Want “OpenAI, but safe and cheap and updatable”

You can give them:

White-label endpoints
Tenant-specific fine-tuning from their own data
Built-in RAG without MLOps plumbing

🧠 3. AI Infra Vendors Falling Short

There’s a clear gap:

OpenLLM, LocalAI, vLLM – great OSS, no plug-and-play platform
AWS Bedrock, Anyscale – powerful, but expensive and not easy to extend
Fireworks, Predibase – fine-tune-focused, RAG is bolt-on

You can position your platform as:

Open-source + Enterprise-ready
Bring your model / Bring your GPU
Multi-tenant, compliant, traceable

💸 Where You Monetize (Business Models)

1. Infra Platform-as-a-Service

Sell access to managed, OpenAI-style endpoints where each customer gets:

Their own inference runtime (LLaMA/Mistral)
Their own LoRA adapter store
Their own opt-in fine-tune pipeline
Optional private RAG

Ideal customer: SaaS companies, research firms, compliance-heavy industries
Pricing: Monthly fee + token usage + fine-tune cost

2. Enterprise On-Prem Deployment

Package the whole stack as a Helm chart / Docker bundle + support contract.

You earn by:

Installation, onboarding
Annual support & update licensing
Training the org’s model on their data (LoRA-as-a-service)

3. White-Label Kit

Let startups integrate your stack but rebrand the UI & endpoints.