DEV Community

deharoalexandre-cyber
deharoalexandre-cyber

Posted on

I built an Ollama alternative with TurboQuant, model groups, and multi-GPU support

The problem

I run multi-model architectures — 3 LLMs receiving the same prompt, deliberating, and producing a consensus response. Think of it as a voting system where individual model biases cancel out.

Ollama swaps models sequentially. vLLM is cloud-oriented. llama.cpp server handles one model at a time. None of them could do what I needed: load 3+ models simultaneously, send them the same prompt in parallel, collect all responses, and handle failures gracefully.

So I built EIE.

What EIE does

EIE (Elyne Inference Engine) is a local inference server for GGUF models. It loads models, serves them via an OpenAI-compatible REST API, and manages GPU memory.

It does one thing: serve completions. No agents, no RAG, no UI. Everything else runs on top.

Model Groups

This is the core idea. Instead of thinking in individual models, EIE thinks in groups:

groups:
  - name: core
    models: [mistral-7b, granite-3b, exaone-2.4b]
    required_responses: 3
    type: parallel
    pinned: true
    fallback: partial
Enter fullscreen mode Exit fullscreen mode

Three execution patterns:

  • Parallel — same prompt to N models simultaneously, all responses returned
  • Sequential — output of model A becomes input of model B (vision → language pipelines)
  • Fan-out — same prompt to N models, best response selected
# Execute a group
curl http://localhost:8080/v1/batch/execute \
  -H "Content-Type: application/json" \
  -d '{
    "group": "core",
    "messages": [{"role": "user", "content": "Analyze this alert"}]
  }'

# Returns all 3 responses with latency and status
Enter fullscreen mode Exit fullscreen mode

Policy Engine

Scheduling behavior is not hardcoded — it's driven by pluggable strategies:

  • generic — on-demand loading, LRU eviction. Ollama replacement.
  • pinned-group — N models permanently loaded, multi-response required. Multi-model deliberation.
  • multi-group — multiple pinned groups with own rules. Dual-core architectures (2×3 LLMs).
  • fixed-appliance — pre-loaded at boot, no dynamic loading. Edge devices.

Custom strategies can be loaded from shared libraries without recompiling:

policy:
  strategy: plugin:libmystrategy.so
Enter fullscreen mode Exit fullscreen mode

Fallback strategies

If one model in a group fails or times out:

  • strict — entire request fails (default)
  • partial — return what completed, flag as incomplete
  • retry_once — retry the failed model, then fall back to partial
  • replace_with — swap in a backup model and continue

This is critical for production. A single slow model shouldn't kill your entire pipeline.

TurboQuant native

TurboQuant (Google Research, ICLR 2026) compresses the KV cache to 3 bits per value using Walsh-Hadamard transforms + Lloyd-Max quantization. ~5× compression with minimal quality loss.

EIE supports it as a first-class option:

  • f16 — no compression, debug/baseline
  • q8_0 — ~2× compression, sensitive models
  • turbo4 — ~4× compression, quality > compression
  • turbo3 — ~5× compression, production default
  • turbo2 — ~6.4× compression, extreme memory pressure

The interesting part: adaptive KV. If the health-check detects a model under memory pressure (latency spike), the Policy Engine can downgrade turbo3 → turbo2 at runtime without reloading the model.

inference:
  kv_cache:
    mode: auto  # picks best format based on available VRAM
Enter fullscreen mode Exit fullscreen mode

VRAM Quality of Service

Explicit memory management with per-group budgets:

vram:
  reserve_mb: 512
  low_watermark: 85    # start evicting non-pinned models
  critical_watermark: 95  # force eviction
  group_isolation: true
Enter fullscreen mode Exit fullscreen mode

CUDA + ROCm from the same codebase

One build flag changes the GPU backend:

cmake -B build -DGGML_CUDA=ON  # NVIDIA
cmake -B build -DGGML_HIP=ON   # AMD
cmake -B build                 # CPU fallback
Enter fullscreen mode Exit fullscreen mode

Backend is auto-detected at runtime. The entire engine above the backend layer is completely GPU-agnostic.

AMD ROCm is a first-class target, not an afterthought. For appliance deployments, an AMD Radeon PRO W7900 (48 GB) at a fraction of the cost of an A100 makes multi-model serving very practical.

VRAM budget examples

With TurboQuant turbo3, Q4_K_M weights, 4096 context:

  • 3-model group on RTX 4090 16 GB → ~7.7 GB used, 8.3 GB free
  • 6-model dual-core on AMD W7900 48 GB → ~16 GB used, 32 GB free
  • 6 LLMs + vision on AMD W7900 48 GB → ~18 GB used, 30 GB free

Without TurboQuant, the 3-model setup would need ~9.2 GB — the difference between fitting comfortably and running tight.

Architecture

Clients (any HTTP client)
       |
  [API Layer]
  Layer 1: OpenAI-compatible (drop-in)
  Layer 2: Generic extensions (/v1/batch/execute, /v1/chain/execute)
       |
  [Policy Engine] ← YAML config + hot-reload
       |
  [Group Scheduler]
  Parallel | Sequential | Fan-out
  Fallback: strict | partial | retry | replace
  Health-check → adaptive KV downgrade
       |
  [Model Manager + VRAM Manager]
       |
  [Inference Workers]
       |
  [ComputeBackend]
  CudaBackend | HipBackend | CpuBackend
Enter fullscreen mode Exit fullscreen mode

~1,300 lines of C++17. Based on llama.cpp (TurboQuant fork).

How it compares

  • Ollama — no scheduling, no groups, no TurboQuant, sequential model swap only
  • vLLM — cloud-oriented, no TurboQuant, no policy engine, no model groups
  • llama.cpp server — single model, no scheduling, no VRAM QoS, no fallback

Getting started

git clone https://github.com/deharoalexandre-cyber/EIE.git
cd EIE && git submodule update --init
./scripts/build-cuda.sh
./build/eie-server --config presets/generic.yaml
Enter fullscreen mode Exit fullscreen mode

Standard OpenAI API on localhost:8080. Any existing client works without modification.

What's next

  • Wire the real llama.cpp inference loop (placeholders are in cpu_backend.cpp with all integration points marked)
  • Validate TurboQuant on AMD ROCm
  • JSON request parsing for the API routes
  • Community scheduling strategies in contrib/

Links

Feedback welcome — especially from anyone running multi-model setups or working with TurboQuant on ROCm. What scheduling strategies would be useful to you?

Top comments (0)