The problem
I run multi-model architectures — 3 LLMs receiving the same prompt, deliberating, and producing a consensus response. Think of it as a voting system where individual model biases cancel out.
Ollama swaps models sequentially. vLLM is cloud-oriented. llama.cpp server handles one model at a time. None of them could do what I needed: load 3+ models simultaneously, send them the same prompt in parallel, collect all responses, and handle failures gracefully.
So I built EIE.
What EIE does
EIE (Elyne Inference Engine) is a local inference server for GGUF models. It loads models, serves them via an OpenAI-compatible REST API, and manages GPU memory.
It does one thing: serve completions. No agents, no RAG, no UI. Everything else runs on top.
Model Groups
This is the core idea. Instead of thinking in individual models, EIE thinks in groups:
groups:
- name: core
models: [mistral-7b, granite-3b, exaone-2.4b]
required_responses: 3
type: parallel
pinned: true
fallback: partial
Three execution patterns:
- Parallel — same prompt to N models simultaneously, all responses returned
- Sequential — output of model A becomes input of model B (vision → language pipelines)
- Fan-out — same prompt to N models, best response selected
# Execute a group
curl http://localhost:8080/v1/batch/execute \
-H "Content-Type: application/json" \
-d '{
"group": "core",
"messages": [{"role": "user", "content": "Analyze this alert"}]
}'
# Returns all 3 responses with latency and status
Policy Engine
Scheduling behavior is not hardcoded — it's driven by pluggable strategies:
- generic — on-demand loading, LRU eviction. Ollama replacement.
- pinned-group — N models permanently loaded, multi-response required. Multi-model deliberation.
- multi-group — multiple pinned groups with own rules. Dual-core architectures (2×3 LLMs).
- fixed-appliance — pre-loaded at boot, no dynamic loading. Edge devices.
Custom strategies can be loaded from shared libraries without recompiling:
policy:
strategy: plugin:libmystrategy.so
Fallback strategies
If one model in a group fails or times out:
- strict — entire request fails (default)
- partial — return what completed, flag as incomplete
- retry_once — retry the failed model, then fall back to partial
- replace_with — swap in a backup model and continue
This is critical for production. A single slow model shouldn't kill your entire pipeline.
TurboQuant native
TurboQuant (Google Research, ICLR 2026) compresses the KV cache to 3 bits per value using Walsh-Hadamard transforms + Lloyd-Max quantization. ~5× compression with minimal quality loss.
EIE supports it as a first-class option:
- f16 — no compression, debug/baseline
- q8_0 — ~2× compression, sensitive models
- turbo4 — ~4× compression, quality > compression
- turbo3 — ~5× compression, production default
- turbo2 — ~6.4× compression, extreme memory pressure
The interesting part: adaptive KV. If the health-check detects a model under memory pressure (latency spike), the Policy Engine can downgrade turbo3 → turbo2 at runtime without reloading the model.
inference:
kv_cache:
mode: auto # picks best format based on available VRAM
VRAM Quality of Service
Explicit memory management with per-group budgets:
vram:
reserve_mb: 512
low_watermark: 85 # start evicting non-pinned models
critical_watermark: 95 # force eviction
group_isolation: true
CUDA + ROCm from the same codebase
One build flag changes the GPU backend:
cmake -B build -DGGML_CUDA=ON # NVIDIA
cmake -B build -DGGML_HIP=ON # AMD
cmake -B build # CPU fallback
Backend is auto-detected at runtime. The entire engine above the backend layer is completely GPU-agnostic.
AMD ROCm is a first-class target, not an afterthought. For appliance deployments, an AMD Radeon PRO W7900 (48 GB) at a fraction of the cost of an A100 makes multi-model serving very practical.
VRAM budget examples
With TurboQuant turbo3, Q4_K_M weights, 4096 context:
- 3-model group on RTX 4090 16 GB → ~7.7 GB used, 8.3 GB free
- 6-model dual-core on AMD W7900 48 GB → ~16 GB used, 32 GB free
- 6 LLMs + vision on AMD W7900 48 GB → ~18 GB used, 30 GB free
Without TurboQuant, the 3-model setup would need ~9.2 GB — the difference between fitting comfortably and running tight.
Architecture
Clients (any HTTP client)
|
[API Layer]
Layer 1: OpenAI-compatible (drop-in)
Layer 2: Generic extensions (/v1/batch/execute, /v1/chain/execute)
|
[Policy Engine] ← YAML config + hot-reload
|
[Group Scheduler]
Parallel | Sequential | Fan-out
Fallback: strict | partial | retry | replace
Health-check → adaptive KV downgrade
|
[Model Manager + VRAM Manager]
|
[Inference Workers]
|
[ComputeBackend]
CudaBackend | HipBackend | CpuBackend
~1,300 lines of C++17. Based on llama.cpp (TurboQuant fork).
How it compares
- Ollama — no scheduling, no groups, no TurboQuant, sequential model swap only
- vLLM — cloud-oriented, no TurboQuant, no policy engine, no model groups
- llama.cpp server — single model, no scheduling, no VRAM QoS, no fallback
Getting started
git clone https://github.com/deharoalexandre-cyber/EIE.git
cd EIE && git submodule update --init
./scripts/build-cuda.sh
./build/eie-server --config presets/generic.yaml
Standard OpenAI API on localhost:8080. Any existing client works without modification.
What's next
- Wire the real llama.cpp inference loop (placeholders are in
cpu_backend.cppwith all integration points marked) - Validate TurboQuant on AMD ROCm
- JSON request parsing for the API routes
- Community scheduling strategies in
contrib/
Links
- GitHub: github.com/deharoalexandre-cyber/EIE
- License: Apache 2.0
- Preprint: https://doi.org/10.5281/zenodo.19439972
Feedback welcome — especially from anyone running multi-model setups or working with TurboQuant on ROCm. What scheduling strategies would be useful to you?
Top comments (0)