I built a self-hosted LLM stack that grades itself — audit trail, per-user auth, and a built-in acceptance test

#ai #llm #selfhosted #devops

canonical_url: https://dev.to/elvisyao007/REPLACE-AFTER-PUBLISH

Repo: https://github.com/elvisyao007/onprem-llm-stack (Apache-2.0)
Runs fully on-prem. No data — including the audit log — leaves the box.

Most "deploy your own ChatGPT" tutorials stop at the moment the container answers a question in a browser. That's the easy 20%. The hard 80% is everything an enterprise actually asks before it puts the thing in front of users: Who can call it? What did they ask? And how do I know it's good enough to ship — objectively, not by vibes?

The reason this matters isn't theoretical. Across 2026 enterprise surveys, roughly 88% of AI pilots never reach production, and the most-cited blocker isn't model quality — it's the absence of an evaluation/acceptance bar and the governance around access and audit. A demo that runs is not a production signal.

So I built a stack where the demo isn't the deliverable. The deliverable is three things a tutorial skips:

Data never leaves the box — including the audit log.
Per-user access control with attributable audit — you can answer "who tried to call a model they weren't allowed to."
A built-in acceptance test — one command, and the stack grades itself with an independent judge and gives you a PASS/FAIL.

The boring part (compose + a gateway + a web UI) is the part everyone already has. This post is about the three parts they don't, and the three bugs I only found because I actually tried to run them.

The shape of it

Nothing exotic in the wiring:

Inference: Ollama in the dev profile (the host already runs it), vLLM in the prod profile. The gateway hides which one is behind it.
Gateway: LiteLLM — one place for keys, budgets, model routing, and audit callbacks.
UI: Open WebUI.
Two compose profiles, every image pinned to an exact version (no latest — air-gapped reproducibility is a precondition, not a nice-to-have).

The interesting design choice is what sits at the center: the evaluation methodology is the backbone; the retriever, the model, the framework are all swappable payload. Everything in the stack is config except the question "is this good enough," which is the one thing you can't outsource to a model version bump.

Bug #1: the access check that exists but never runs

The first real feature is per-user virtual keys: alice may call qwen3-32b, bob may call gemma4-31b, and crossing that line should return a 403.

LiteLLM (v1.88.1) ships a function called can_key_call_model. It does exactly what the name says. The problem: on the custom-auth path, it's never invoked from common_checks. So with a custom authenticator wired in, a key authorized for any model could call every model. The guardrail was in the codebase and silently bypassed.

The fix wasn't to monkey-patch the routing layer. It was to enforce access at the only point where I had both the authenticated identity and the requested model in hand: read the model out of the raw request body inside the auth hook, check it against the key's allow-list, and raise 403 before returning the auth object.

alice → qwen3-32b   → 200 OK
alice → gemma4-31b  → 403 model_access_denied
bob   → gemma4-31b  → 200 OK
bob   → qwen3-32b   → 403 model_access_denied

The lesson I'd hand to anyone wiring custom auth into a gateway: a function existing in the library is not the same as that function running on your code path. Verify the denial, don't assume the helper fires.

Bug #2: "someone broke the rules, but we don't know who"

With denials working, I checked the audit log. The successful requests were fine — user, model, token counts, latency. The denied requests were recorded as user_id='unknown'.

That's the worst possible failure for a security audit. "An unauthorized attempt happened and we can't attribute it" is exactly the line you don't want in front of an enterprise security reviewer. And it's backwards from how audit value actually works: who tried to cross a boundary is more important to log than who used the system normally.

The root cause was a sequencing problem. LiteLLM calls the failure callback after the custom auth raises — and by then the request metadata is empty, so the callback has no identity to attribute the row to.

The fix mirrors Bug #1: write the audit record at the one moment the context exists — inside the auth hook, before raising the 403 — with the correct user, key label, model, and denial reason. Then I tag the exception so the downstream failure callback sees it's already been logged and skips it, instead of writing a second unknown row.

One detail I left deliberately: genuinely invalid keys (keys that don't exist in the system at all) still log as unknown. That's honest — there's no identity to attribute. The audit distinguishes "a known user attempted something they weren't allowed to" from "an unidentifiable caller hit the door." Those are different events and the log should say so.

The actual differentiator: the stack grades itself

Here's the part no compose tutorial has. After you bring the stack up, you run:

make smoke-eval

and it runs a small, fully offline acceptance test: ~15 neutral factual questions, asked through the gateway to the model under test, then scored by a different model acting as judge — and prints a PASS/FAIL against a threshold.

Two principles, both non-negotiable:

The judge is never the generator. Default generator is qwen3-32b; default judge is gemma4-31b — different model families, so nothing grades its own homework. The summary JSON literally carries judge_independent: true, and the report states it in plain text. A self-graded eval is worth nothing; if you remember one thing from this section, make it that.

The golden set contains zero real data. It's neutral technical/general-knowledge questions. An acceptance test that ships with customer data would contradict the entire "nothing leaves the box" premise — including, especially, the test itself.

My run:

smoke-eval  →  PASS  11/15 (73.3%)   threshold 70%
generator: qwen3-32b   judge: gemma4-31b (independent)

73.3%, not 100% — and that's the point. An acceptance test that returns a perfect score on first run is a test that isn't testing anything: either the questions are trivial or the judge is lenient. Four failing questions means the bar has resolution. The number you can trust is the one that can come back red.

This is the line between a demo and a production system. The enterprise blocker isn't "can it answer" — it's "by what objective standard is it good enough to ship." The stack answers that in the first minute, on the customer's own hardware, with a judge that never phones home.

Bug #3: two big models, one 32 GB GPU

Running the eval surfaced the hardware reality. qwen3:32b needs ~29 GB; gemma4:31b needs ~19 GB. Generator + judge = 48 GB on a card that holds 32. They cannot co-reside.

The fix is a two-pass design: generate all answers first, evict the generator (keep_alive=0), then load the judge and score all answers in a second pass. The naive structure — generate one, judge one — would thrash the GPU, swapping a 20–30 GB model in and out on every single question. Batching the passes turns dozens of model loads into exactly two.

There was a second, subtler trap. Both models are "thinking" models — they spend output tokens on chain-of-thought before the answer. With a tight judge token budget, the CoT exhausts the allowance and you get empty content back, which the judge then can't score. The fix was to pass think:false through the gateway and raise the judge's token ceiling. You don't see this one unless you actually run the loop end to end on real hardware; it never shows up in a notebook.

What I deliberately left out

v0.1 ships auth, audit, and the acceptance test. It does not ship PII content guardrails, SSO/LDAP, Langfuse-style observability, Kubernetes, or multi-GPU serving. Those are on the roadmap, written down as roadmap — not quietly absent. Scope discipline is the whole game for a solo build: a small thing that actually survives enterprise reality beats a big thing that's 80% stubs.

Why this exists

The three bugs above share a shape: the capability looked present, and only running it proved whether it was real. The access check existed but didn't fire. The audit logged, but not the events that mattered. The eval would run, but thrash the GPU and feed the judge empty answers. None of these are visible from the README of the tool you're integrating — they're visible from the failing run.

That's the difference I'm trying to build into everything here: not a system that runs, but a system that survives dirty data, multiple users, data that can't leave the building, and an objective, repeatable definition of "good enough to ship."

The deployment stack is one repo. The full evaluation methodology lives in two companions:

eval-driven-llm — the eval-first reference system (frozen golden sets, pinned independent judge, deterministic retrieval metrics).
eval-sanity — a zero-dependency tool that audits whether your retrieval metric can be trusted in the first place.

Stack: onprem-llm-stack. Clone it, bring it up, run make smoke-eval, and watch it grade itself.