<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: deharoalexandre-cyber</title>
    <description>The latest articles on DEV Community by deharoalexandre-cyber (@elynecorp).</description>
    <link>https://dev.to/elynecorp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3866678%2Fcec5970c-7b54-4aad-893e-499700476542.png</url>
      <title>DEV Community: deharoalexandre-cyber</title>
      <link>https://dev.to/elynecorp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/elynecorp"/>
    <language>en</language>
    <item>
      <title>I built an Ollama alternative with TurboQuant, model groups, and multi-GPU support</title>
      <dc:creator>deharoalexandre-cyber</dc:creator>
      <pubDate>Wed, 08 Apr 2026 00:34:18 +0000</pubDate>
      <link>https://dev.to/elynecorp/i-built-an-ollama-alternative-with-turboquant-model-groups-and-multi-gpu-support-555p</link>
      <guid>https://dev.to/elynecorp/i-built-an-ollama-alternative-with-turboquant-model-groups-and-multi-gpu-support-555p</guid>
      <description>&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;I run multi-model architectures — 3 LLMs receiving the same prompt, deliberating, and producing a consensus response. Think of it as a voting system where individual model biases cancel out.&lt;/p&gt;

&lt;p&gt;Ollama swaps models sequentially. vLLM is cloud-oriented. llama.cpp server handles one model at a time. None of them could do what I needed: load 3+ models simultaneously, send them the same prompt in parallel, collect all responses, and handle failures gracefully.&lt;/p&gt;

&lt;p&gt;So I built EIE.&lt;/p&gt;

&lt;h2&gt;
  
  
  What EIE does
&lt;/h2&gt;

&lt;p&gt;EIE (Elyne Inference Engine) is a local inference server for GGUF models. It loads models, serves them via an OpenAI-compatible REST API, and manages GPU memory.&lt;/p&gt;

&lt;p&gt;It does &lt;strong&gt;one thing&lt;/strong&gt;: serve completions. No agents, no RAG, no UI. Everything else runs on top.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Groups
&lt;/h3&gt;

&lt;p&gt;This is the core idea. Instead of thinking in individual models, EIE thinks in &lt;strong&gt;groups&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;core&lt;/span&gt;
    &lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;mistral-7b&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;granite-3b&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;exaone-2.4b&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;required_responses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;parallel&lt;/span&gt;
    &lt;span class="na"&gt;pinned&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;partial&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three execution patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parallel&lt;/strong&gt; — same prompt to N models simultaneously, all responses returned&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential&lt;/strong&gt; — output of model A becomes input of model B (vision → language pipelines)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fan-out&lt;/strong&gt; — same prompt to N models, best response selected
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Execute a group&lt;/span&gt;
curl http://localhost:8080/v1/batch/execute &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "group": "core",
    "messages": [{"role": "user", "content": "Analyze this alert"}]
  }'&lt;/span&gt;

&lt;span class="c"&gt;# Returns all 3 responses with latency and status&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Policy Engine
&lt;/h3&gt;

&lt;p&gt;Scheduling behavior is not hardcoded — it's driven by pluggable strategies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;generic&lt;/strong&gt; — on-demand loading, LRU eviction. Ollama replacement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pinned-group&lt;/strong&gt; — N models permanently loaded, multi-response required. Multi-model deliberation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;multi-group&lt;/strong&gt; — multiple pinned groups with own rules. Dual-core architectures (2×3 LLMs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;fixed-appliance&lt;/strong&gt; — pre-loaded at boot, no dynamic loading. Edge devices.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Custom strategies can be loaded from shared libraries without recompiling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plugin:libmystrategy.so&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Fallback strategies
&lt;/h3&gt;

&lt;p&gt;If one model in a group fails or times out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;strict&lt;/strong&gt; — entire request fails (default)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;partial&lt;/strong&gt; — return what completed, flag as incomplete&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;retry_once&lt;/strong&gt; — retry the failed model, then fall back to partial&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;replace_with&lt;/strong&gt; — swap in a backup model and continue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is critical for production. A single slow model shouldn't kill your entire pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  TurboQuant native
&lt;/h3&gt;

&lt;p&gt;TurboQuant (Google Research, ICLR 2026) compresses the KV cache to 3 bits per value using Walsh-Hadamard transforms + Lloyd-Max quantization. ~5× compression with minimal quality loss.&lt;/p&gt;

&lt;p&gt;EIE supports it as a first-class option:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;f16&lt;/strong&gt; — no compression, debug/baseline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;q8_0&lt;/strong&gt; — ~2× compression, sensitive models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;turbo4&lt;/strong&gt; — ~4× compression, quality &amp;gt; compression&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;turbo3&lt;/strong&gt; — ~5× compression, production default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;turbo2&lt;/strong&gt; — ~6.4× compression, extreme memory pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting part: &lt;strong&gt;adaptive KV&lt;/strong&gt;. If the health-check detects a model under memory pressure (latency spike), the Policy Engine can downgrade turbo3 → turbo2 &lt;strong&gt;at runtime without reloading the model&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;inference&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kv_cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;  &lt;span class="c1"&gt;# picks best format based on available VRAM&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  VRAM Quality of Service
&lt;/h3&gt;

&lt;p&gt;Explicit memory management with per-group budgets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;vram&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;reserve_mb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;512&lt;/span&gt;
  &lt;span class="na"&gt;low_watermark&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;85&lt;/span&gt;    &lt;span class="c1"&gt;# start evicting non-pinned models&lt;/span&gt;
  &lt;span class="na"&gt;critical_watermark&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;95&lt;/span&gt;  &lt;span class="c1"&gt;# force eviction&lt;/span&gt;
  &lt;span class="na"&gt;group_isolation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CUDA + ROCm from the same codebase
&lt;/h3&gt;

&lt;p&gt;One build flag changes the GPU backend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON  &lt;span class="c"&gt;# NVIDIA&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_HIP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON   &lt;span class="c"&gt;# AMD&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build                 &lt;span class="c"&gt;# CPU fallback&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Backend is auto-detected at runtime. The entire engine above the backend layer is completely GPU-agnostic.&lt;/p&gt;

&lt;p&gt;AMD ROCm is a &lt;strong&gt;first-class target&lt;/strong&gt;, not an afterthought. For appliance deployments, an AMD Radeon PRO W7900 (48 GB) at a fraction of the cost of an A100 makes multi-model serving very practical.&lt;/p&gt;

&lt;h2&gt;
  
  
  VRAM budget examples
&lt;/h2&gt;

&lt;p&gt;With TurboQuant turbo3, Q4_K_M weights, 4096 context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3-model group&lt;/strong&gt; on RTX 4090 16 GB → ~7.7 GB used, 8.3 GB free&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6-model dual-core&lt;/strong&gt; on AMD W7900 48 GB → ~16 GB used, 32 GB free&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6 LLMs + vision&lt;/strong&gt; on AMD W7900 48 GB → ~18 GB used, 30 GB free&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without TurboQuant, the 3-model setup would need ~9.2 GB — the difference between fitting comfortably and running tight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Clients (any HTTP client)
       |
  [API Layer]
  Layer 1: OpenAI-compatible (drop-in)
  Layer 2: Generic extensions (/v1/batch/execute, /v1/chain/execute)
       |
  [Policy Engine] ← YAML config + hot-reload
       |
  [Group Scheduler]
  Parallel | Sequential | Fan-out
  Fallback: strict | partial | retry | replace
  Health-check → adaptive KV downgrade
       |
  [Model Manager + VRAM Manager]
       |
  [Inference Workers]
       |
  [ComputeBackend]
  CudaBackend | HipBackend | CpuBackend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;~1,300 lines of C++17. Based on llama.cpp (TurboQuant fork).&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; — no scheduling, no groups, no TurboQuant, sequential model swap only&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; — cloud-oriented, no TurboQuant, no policy engine, no model groups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp server&lt;/strong&gt; — single model, no scheduling, no VRAM QoS, no fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/deharoalexandre-cyber/EIE.git
&lt;span class="nb"&gt;cd &lt;/span&gt;EIE &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git submodule update &lt;span class="nt"&gt;--init&lt;/span&gt;
./scripts/build-cuda.sh
./build/eie-server &lt;span class="nt"&gt;--config&lt;/span&gt; presets/generic.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Standard OpenAI API on &lt;code&gt;localhost:8080&lt;/code&gt;. Any existing client works without modification.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Wire the real llama.cpp inference loop (placeholders are in &lt;code&gt;cpu_backend.cpp&lt;/code&gt; with all integration points marked)&lt;/li&gt;
&lt;li&gt;Validate TurboQuant on AMD ROCm&lt;/li&gt;
&lt;li&gt;JSON request parsing for the API routes&lt;/li&gt;
&lt;li&gt;Community scheduling strategies in &lt;code&gt;contrib/&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/deharoalexandre-cyber/EIE" rel="noopener noreferrer"&gt;github.com/deharoalexandre-cyber/EIE&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: Apache 2.0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preprint&lt;/strong&gt;: &lt;a href="https://doi.org/10.5281/zenodo.19439972" rel="noopener noreferrer"&gt;https://doi.org/10.5281/zenodo.19439972&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feedback welcome — especially from anyone running multi-model setups or working with TurboQuant on ROCm. What scheduling strategies would be useful to you?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>cpp</category>
    </item>
  </channel>
</rss>
