<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Xavier Rey-Robert</title>
    <description>The latest articles on DEV Community by Xavier Rey-Robert (@xreyrobertibm).</description>
    <link>https://dev.to/xreyrobertibm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F142674%2Fedb78af8-8f9e-4742-aa14-ca053b456f53.jpg</url>
      <title>DEV Community: Xavier Rey-Robert</title>
      <link>https://dev.to/xreyrobertibm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/xreyrobertibm"/>
    <language>en</language>
    <item>
      <title>Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe</title>
      <dc:creator>Xavier Rey-Robert</dc:creator>
      <pubDate>Fri, 19 Jun 2026 18:47:10 +0000</pubDate>
      <link>https://dev.to/xreyrobertibm/qwen36-27b-vllm-hermes-on-24gb-vram-may-2026-recipe-5452</link>
      <guid>https://dev.to/xreyrobertibm/qwen36-27b-vllm-hermes-on-24gb-vram-may-2026-recipe-5452</guid>
      <description>&lt;p&gt;If you want to reproduce my current local Hermes Agent + Qwen3.6-27B setup, this is the shape I would start from.&lt;/p&gt;

&lt;h2&gt;
  
  
  Target
&lt;/h2&gt;

&lt;p&gt;One local coding agent.&lt;br&gt;
One 24GB GPU.&lt;br&gt;
Long context.&lt;br&gt;
Tools enabled.&lt;br&gt;
Thinking enabled.&lt;/p&gt;

&lt;p&gt;No child agents fighting the main request.&lt;/p&gt;

&lt;p&gt;The goal is not peak tok/s on a short prompt. The goal is: can the same agent session keep working after hours of tool calls without losing prefix locality, timing out during prefill, or getting wrecked by auxiliary requests?&lt;/p&gt;
&lt;h2&gt;
  
  
  Model
&lt;/h2&gt;

&lt;p&gt;This setup is intentionally text-only.&lt;/p&gt;

&lt;p&gt;I am not serving the multimodal GGUF variant here. The working configuration uses &lt;code&gt;groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit&lt;/code&gt; through vLLM with &lt;code&gt;--language-model-only&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That choice matters. On a 24GB RTX 3090, the text-only GPTQ-Marlin path gave the best balance I found between long context, prefix caching, stable agent behavior and usable decode speed. Vision should be handled by a separate service/model if needed.&lt;/p&gt;
&lt;h2&gt;
  
  
  vLLM
&lt;/h2&gt;

&lt;p&gt;The useful shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 vllm serve groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--served-model-name&lt;/span&gt; qwen3.6-27b-gptq-pro-4bit &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--dtype&lt;/span&gt; float16 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--quantization&lt;/span&gt; gptq_marlin &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 131072 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--max-num-seqs&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--kv-cache-dtype&lt;/span&gt; fp8_e5m2 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--reasoning-parser&lt;/span&gt; qwen3 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--enable-auto-tool-choice&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--tool-call-parser&lt;/span&gt; qwen3_coder &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.95 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--max-cudagraph-capture-size&lt;/span&gt; 32 &lt;span class="se"&gt;\&lt;/span&gt;
   &lt;span class="nt"&gt;--language-model-only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I used a recent vLLM nightly, not an old stable image (&lt;code&gt;0.20.1rc1.dev16+g7a1eb8ac2&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The two flags people will want to argue about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--max-num-seqs 1
--max-model-len 131072
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I use &lt;code&gt;max_num_seqs=1&lt;/code&gt; deliberately. With an agent, parallelism is not free. Title generation, context compression, retries, browser checks, tool calls and side jobs can all steal KV/cache locality from the main request. On one 24GB GPU I prefer one useful request over two requests sabotaging each other.&lt;/p&gt;

&lt;p&gt;131k context is tight, but workable here. If your service OOMs, reduce context before adding MTP or enforce-eager. I would test 110k, then 100k, then 80k.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Did Not Keep
&lt;/h2&gt;

&lt;p&gt;No Qwen3 Next MTP/speculative decoding in the stable config. It caused crashes/OOMs/404s for my useful context sizes.&lt;/p&gt;

&lt;p&gt;No enforce-eager. (saves memory but degrades performances)&lt;/p&gt;

&lt;p&gt;No explicit Hermes &lt;code&gt;max_tokens: 16384&lt;/code&gt; cap. I removed it because it made debugging truncation and long reasoning/final-answer behavior harder.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hermes
&lt;/h2&gt;

&lt;p&gt;Point Hermes at the OpenAI-compatible vLLM endpoint and use the same served model name: &lt;code&gt;qwen3.6-27b-gptq-pro-4bit&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The settings that mattered for me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;context around 131072&lt;/li&gt;
&lt;li&gt;thinking enabled&lt;/li&gt;
&lt;li&gt;preserve_thinking enabled&lt;/li&gt;
&lt;li&gt;long provider/client timeout&lt;/li&gt;
&lt;li&gt;child agents disabled&lt;/li&gt;
&lt;li&gt;no hard max_tokens cap&lt;/li&gt;
&lt;li&gt;tool calls allowed, but not parallelized into chaos&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The timeout matters. At large context, a real prefill can look like a dead provider if the client gives up after 180s. Use long timeouts before blaming the model.&lt;/p&gt;

&lt;p&gt;Here is the Hermes-specific config excerpt, redacted and trimmed to the parts that matter for the Qwen3.6/vLLM setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen3.6-27b-gptq-pro-4bit&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-qwen36.mylabdomain.com&lt;/span&gt;
  &lt;span class="na"&gt;context_length&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;131072&lt;/span&gt;
  &lt;span class="na"&gt;extra_body&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;chat_template_kwargs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enable_thinking&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;preserve_thinking&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;vllm-qwen36.mylabdomain.com&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vLLM Qwen3.6 27B&lt;/span&gt;
    &lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://vllm-qwen36.mylabdomain.com/v1&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;redacted&amp;gt;&lt;/span&gt;
    &lt;span class="na"&gt;default_model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen3.6-27b-gptq-pro-4bit&lt;/span&gt;
    &lt;span class="na"&gt;request_timeout_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1800&lt;/span&gt;
    &lt;span class="na"&gt;stale_timeout_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1800&lt;/span&gt;

&lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_turns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;240&lt;/span&gt;
  &lt;span class="na"&gt;gateway_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1800&lt;/span&gt;
  &lt;span class="na"&gt;gateway_timeout_warning&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;900&lt;/span&gt;
  &lt;span class="na"&gt;gateway_notify_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;600&lt;/span&gt;
  &lt;span class="na"&gt;gateway_auto_continue_freshness&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;
  &lt;span class="na"&gt;api_max_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;reasoning_effort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;none&lt;/span&gt; &lt;span class="c1"&gt;# does not disable Qwen thinking&lt;/span&gt;
  &lt;span class="na"&gt;verbose&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;image_input_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;text&lt;/span&gt;
  &lt;span class="na"&gt;disabled_toolsets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;delegation&lt;/span&gt;

&lt;span class="na"&gt;compression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;
  &lt;span class="na"&gt;target_ratio&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.2&lt;/span&gt;
  &lt;span class="na"&gt;protect_last_n&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
  &lt;span class="na"&gt;hygiene_hard_message_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;400&lt;/span&gt;

&lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;engine&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compressor&lt;/span&gt;

&lt;span class="na"&gt;delegation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_concurrent_children&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
  &lt;span class="na"&gt;child_timeout_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1800&lt;/span&gt;
  &lt;span class="na"&gt;max_spawn_depth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;orchestrator_enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;inherit_mcp_toolsets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;default_toolsets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terminal&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;file&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;web&lt;/span&gt;

&lt;span class="na"&gt;auxiliary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;compression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;custom&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen3.6-27b-gptq-pro-4bit&lt;/span&gt;
    &lt;span class="na"&gt;base_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://vllm-qwen36.mylabdomain.com/v1&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;redacted&amp;gt;&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1800&lt;/span&gt;
    &lt;span class="na"&gt;extra_body&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;chat_template_kwargs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;enable_thinking&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="na"&gt;title_generation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;custom&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen3.6-27b-gptq-pro-4bit&lt;/span&gt;
    &lt;span class="na"&gt;base_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://vllm-qwen36.mylabdomain.com/v1&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;redacted&amp;gt;&lt;/span&gt;
    &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1800&lt;/span&gt;
    &lt;span class="na"&gt;extra_body&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;chat_template_kwargs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;enable_thinking&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;display&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;streaming&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;show_reasoning&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;interim_assistant_messages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;tool_progress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;

&lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_parallel_jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cache Discipline
&lt;/h2&gt;

&lt;p&gt;This part is mostly orchestration, not a vLLM flag.&lt;/p&gt;

&lt;p&gt;What I can control: stable cron prompts, stable skills, no child-agent swarm, no parallel debug/dev jobs, long timeouts, and one main request at a time.&lt;/p&gt;

&lt;p&gt;What I cannot fully control: the exact way Hermes serializes every internal prompt. If volatile session state or auxiliary material lands before stable instructions, prefix reuse will suffer.&lt;/p&gt;

&lt;p&gt;So the practical rule is simpler: make the inputs you control boring and repeatable, and avoid side requests competing with the main session.&lt;/p&gt;

&lt;p&gt;vLLM prefix caching helps, but it is not a magic persistent cache database. Treat it as an in-memory serving optimization and shape your traffic accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expected Behavior
&lt;/h2&gt;

&lt;p&gt;Healthy run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;low TTFT when prefix reuse hits&lt;/li&gt;
&lt;li&gt;60-90s TTFT can still happen on large context transitions&lt;/li&gt;
&lt;li&gt;decode around high-30s tok/s on my setup&lt;/li&gt;
&lt;li&gt;stable tool use&lt;/li&gt;
&lt;li&gt;no recurring full-prefill on every continuation&lt;/li&gt;
&lt;li&gt;no child-agent swarm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bad run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;repeated full-prefill on every request&lt;/li&gt;
&lt;li&gt;auxiliary requests firing while the main agent waits&lt;/li&gt;
&lt;li&gt;model reaches output cap before final answer&lt;/li&gt;
&lt;li&gt;MTP instability/OOM&lt;/li&gt;
&lt;li&gt;tool loops caused by sampling/config/model mismatch&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Performance Snapshot
&lt;/h2&gt;

&lt;p&gt;This is not a formal benchmark, just a sanity check against the real OpenAI-compatible endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Small prompt:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;prompt: 41 tokens&lt;/li&gt;
&lt;li&gt;output: 384 tokens&lt;/li&gt;
&lt;li&gt;TTFT: 0.29s&lt;/li&gt;
&lt;li&gt;decode: 45.3 tok/s&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Large prompt, cold-ish prefix:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;prompt: 41,985 tokens&lt;/li&gt;
&lt;li&gt;output: 384 tokens&lt;/li&gt;
&lt;li&gt;TTFT: 38.6s&lt;/li&gt;
&lt;li&gt;decode: 41.8 tok/s&lt;/li&gt;
&lt;li&gt;effective prefill: ~1,087 prompt tok/s&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Same large prompt immediately repeated:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;prompt: 41,985 tokens&lt;/li&gt;
&lt;li&gt;output: 384 tokens&lt;/li&gt;
&lt;li&gt;TTFT: 1.59s&lt;/li&gt;
&lt;li&gt;decode: 42.1 tok/s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last line is the important one for agent work. Prefix caching does not make the model "faster" in the abstract; it makes repeated long-context continuations stop paying the full prefill cost when the prefix remains stable.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Practical Takeaway
&lt;/h2&gt;

&lt;p&gt;For this workload, the model was not the blocker. Qwen3.6-27B is strong enough to be useful locally as a coding agent.&lt;/p&gt;

&lt;p&gt;The hard part is serving discipline: context size, request sequencing, prefix reuse, timeout policy and avoiding self-inflicted concurrency.&lt;/p&gt;

&lt;p&gt;If you only test "does the model answer one prompt?", you are testing the wrong thing.&lt;/p&gt;

&lt;p&gt;Test the loop.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>vllm</category>
      <category>agents</category>
    </item>
    <item>
      <title>I Stopped Chasing MTP TPS and Got a Local 27B Agent That Actually Stayed Usable on 24GB VRAM</title>
      <dc:creator>Xavier Rey-Robert</dc:creator>
      <pubDate>Fri, 19 Jun 2026 17:26:05 +0000</pubDate>
      <link>https://dev.to/xreyrobertibm/i-stopped-chasing-mtp-tps-and-got-a-local-27b-agent-that-actually-stayed-usable-on-24gb-vram-5897</link>
      <guid>https://dev.to/xreyrobertibm/i-stopped-chasing-mtp-tps-and-got-a-local-27b-agent-that-actually-stayed-usable-on-24gb-vram-5897</guid>
      <description>&lt;p&gt;I was already happy with my &lt;a href="https://huggingface.co/groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit" rel="noopener noreferrer"&gt;groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit&lt;/a&gt; + vLLM + Hermes recipe: one local agent, one 24GB GPU, long context, tools, thinking enabled, and enough serving discipline that the session could keep working after hours of edits, terminal calls, retries, compression, and context growth.&lt;/p&gt;

&lt;p&gt;So when Jackrong released Qwopus3.6-27B-v2, I wanted to see if the same recipe would hold.&lt;/p&gt;

&lt;p&gt;I rented an A100, burned a few dollars on quantization, and published the result: &lt;a href="https://huggingface.co/XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1" rel="noopener noreferrer"&gt;XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The artifact
&lt;/h2&gt;

&lt;p&gt;It is a GPTQ-Pro 4-bit quantized derivative of &lt;a href="https://huggingface.co/Jackrong/Qwopus3.6-27B-v2" rel="noopener noreferrer"&gt;Jackrong/Qwopus3.6-27B-v2&lt;/a&gt;. The GPTQ-Pro recipe and much of the practical quantization know-how come from groxaxo, so credit where it is due: &lt;a href="https://github.com/groxaxo" rel="noopener noreferrer"&gt;https://github.com/groxaxo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal was usability: make this model practical in the local coding-agent setup I actually run.&lt;/p&gt;

&lt;p&gt;That means one local agent, one 24GB GPU, long context, tools, thinking enabled, and enough serving discipline that the session can keep working after hours of edits, terminal calls, retries, compression, and context growth.&lt;/p&gt;

&lt;p&gt;The target was not a short-prompt benchmark.&lt;/p&gt;

&lt;p&gt;it was not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Does it answer one prompt?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The real question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Does the loop hold?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;for reference the quantization shape is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- GPTQ-Pro / GPTQModel
- 4-bit
- group size 128
- 256 calibration samples
- 2048 calibration sequence length
- FOEM alpha 0.25 / beta 0.2
- vLLM + GPTQ-Marlin
- target: RTX 3090 / 24GB VRAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The serving shape
&lt;/h2&gt;

&lt;p&gt;I serve it text-only.&lt;/p&gt;

&lt;p&gt;No multimodal path in this setup. No speculative decoding. No parallel request pile-up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 vllm serve XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--served-model-name&lt;/span&gt; qwopus3.6-27b-gptq-pro-v1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--language-model-only&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dtype&lt;/span&gt; float16 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quantization&lt;/span&gt; gptq_marlin &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 131072 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-num-seqs&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--kv-cache-dtype&lt;/span&gt; fp8_e5m2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning-parser&lt;/span&gt; qwen3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-auto-tool-choice&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tool-call-parser&lt;/span&gt; qwen3_coder &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.95 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--trust-remote-code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;max_num_seqs=1&lt;/code&gt; is deliberate.&lt;/p&gt;

&lt;p&gt;On one 24GB card, parallelism is not free. Title generation, compression, retries, summaries, and side requests can all compete with the main agent request.&lt;/p&gt;

&lt;p&gt;I would rather have one useful request finish cleanly than two requests sabotaging each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why no speculative decoding?
&lt;/h2&gt;

&lt;p&gt;Because on this setup it did not improve the thing I care about: end-to-end long-context agent throughput.&lt;/p&gt;

&lt;p&gt;This artifact should be treated as non-MTP for vLLM speculative decoding. It keeps some MTP-related config metadata, but the published weight index does not contain actual &lt;code&gt;mtp.*&lt;/code&gt; tensors.&lt;/p&gt;

&lt;p&gt;I also tested a follow-up artifact with real MTP tensors restored and the large MTP linears quantized. Draft acceptance was real, but on a single RTX 3090 it was still slower than the non-MTP baseline for the useful 100k-131k context range.&lt;/p&gt;

&lt;p&gt;For this workload, MTP adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;memory pressure&lt;/li&gt;
&lt;li&gt;serving complexity&lt;/li&gt;
&lt;li&gt;no end-to-end speedup on 1x3090&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On larger GPUs or short-prompt workloads, speculative decoding may be worth revisiting.&lt;/p&gt;

&lt;p&gt;For a 24GB long-context coding agent, I leave it off until proven otherwise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token/s is not the whole story
&lt;/h2&gt;

&lt;p&gt;The useful question is whether the agent keeps the prefix hot and avoids paying full prefill again and again.&lt;/p&gt;

&lt;p&gt;Healthy behavior looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;large context still works&lt;/li&gt;
&lt;li&gt;prefix-cache hits are common&lt;/li&gt;
&lt;li&gt;TTFT drops when the prefix is reused&lt;/li&gt;
&lt;li&gt;tool calls stay stable&lt;/li&gt;
&lt;li&gt;the main request is not fighting side jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observed on my 3090-class setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;average prompt: ~33k tokens&lt;/li&gt;
&lt;li&gt;average TTFT: ~5.7s&lt;/li&gt;
&lt;li&gt;prefill throughput: ~1917 tok/s&lt;/li&gt;
&lt;li&gt;decode estimate: ~43 tok/s&lt;/li&gt;
&lt;li&gt;prefix cache hit ratio: ~83%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the metric cluster I care about.&lt;/p&gt;

&lt;p&gt;Not "hello world" speed.&lt;/p&gt;

&lt;p&gt;Repeated long-context continuations.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the vLLM run showed
&lt;/h2&gt;

&lt;p&gt;The more interesting evidence is not a single decode number.&lt;/p&gt;

&lt;p&gt;It is the shape of the vLLM metrics over a long agent-style run.&lt;/p&gt;

&lt;p&gt;In my 12-hour terminal-bench-style run, the endpoint was not just answering tiny prompts. It was handling repeated tasks, retained context, tool calls, longer generations, fresh starts, retries, and compression.&lt;/p&gt;

&lt;p&gt;That is much closer to how a coding agent actually behaves.&lt;/p&gt;

&lt;p&gt;The useful signals were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queue time stayed low&lt;/li&gt;
&lt;li&gt;prefix-cache reuse recovered after task changes&lt;/li&gt;
&lt;li&gt;finish reasons were mostly normal stops&lt;/li&gt;
&lt;li&gt;length caps and errors were not dominating&lt;/li&gt;
&lt;li&gt;tool-call behavior stayed stable under long-context pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The decode number is only half the story. For long-context agents, prefill throughput matters just as much, because every cold or partially cold prompt has to pay that cost before useful generation starts. A decent decode rate with terrible prefill still feels bad. A setup with good prefix reuse and healthy prefill throughput is what makes repeated long-context continuations tolerable.&lt;/p&gt;

&lt;p&gt;The panels I would show are the ones that explain the loop, not just speed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHKzipYBWEAAP8M4%3Fformat%3Djpg%26name%3Dlarge" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHKzipYBWEAAP8M4%3Fformat%3Djpg%26name%3Dlarge" alt="Prompt and generation throughput during the long run" width="1024" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prompt and generation throughput during the long run. The ~43 tok/s generation rate is only half the story: prompt throughput is what determines how painful long-context prefill is when the cache is cold or only partially reusable. A usable local agent needs both decent decode speed and tolerable prefill behavior.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHKzm_2EWcAAlsD0%3Fformat%3Djpg%26name%3Dlarge" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHKzm_2EWcAAlsD0%3Fformat%3Djpg%26name%3Dlarge" alt="Finish reasons during the long run" width="1028" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finish reasons during the long run. The useful signal is that most requests end with normal stop reasons, not length caps or errors. For an agent loop, this matters as much as token throughput.&lt;/p&gt;

&lt;p&gt;The prefix-cache graph is the one I care about most.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHKznocVWoAE9Fhz%3Fformat%3Djpg%26name%3Dlarge" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHKznocVWoAE9Fhz%3Fformat%3Djpg%26name%3Dlarge" alt="Grafana Prefix Cache Hit Rate panel over a long vLLM run" width="1030" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prefix-cache hit rate over the long run. Drops are expected when task shape changes; the useful signal is that cache reuse returns when locality stabilizes.&lt;/p&gt;

&lt;p&gt;A new task naturally breaks locality. That is fine. The important part is that when the prompt shape stabilizes again, prefix reuse comes back.&lt;/p&gt;

&lt;p&gt;That is the difference between a local agent that keeps working and one that keeps paying full prefill until the session becomes painful.&lt;/p&gt;

&lt;h2&gt;
  
  
  TTFT is part of the serving contract
&lt;/h2&gt;

&lt;p&gt;At 100k+ context, TTFT is not always tiny.&lt;/p&gt;

&lt;p&gt;That does not automatically mean the model is slow or broken. Sometimes the server is doing real prefill work. If the prefix is cached, TTFT drops. If the task shape changes, the cache is colder and the server has to pay the cost again.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHKznthMWUAAGtlk%3Fformat%3Djpg%26name%3Dlarge" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FHKznthMWUAAGtlk%3Fformat%3Djpg%26name%3Dlarge" alt="Grafana Time To First Token Latency panel for a long vLLM run" width="1030" height="460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;TTFT during long-context agent traffic. Spikes are not automatically failures; they often mean the server is paying real prefill work after a colder prompt transition.&lt;/p&gt;

&lt;p&gt;This is why short client timeouts are toxic for local long-context agents.&lt;/p&gt;

&lt;p&gt;For this setup, long provider and gateway timeouts are not cosmetic. They are part of making the agent loop reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why FP8 KV matters
&lt;/h2&gt;

&lt;p&gt;At this context length, the weights are only part of the memory story.&lt;/p&gt;

&lt;p&gt;The KV cache becomes the constraint.&lt;/p&gt;

&lt;p&gt;That is why I use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--kv-cache-dtype fp8_e5m2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I do not treat FP8 KV as magic. It is a practical tradeoff that helps make the long-context setup fit on a 24GB card.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical takeaway
&lt;/h2&gt;

&lt;p&gt;The working setup is the combination:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qwopus3.6-27B GPTQ-Pro&lt;/li&gt;
&lt;li&gt;vLLM GPTQ-Marlin&lt;/li&gt;
&lt;li&gt;text-only serving&lt;/li&gt;
&lt;li&gt;131k context&lt;/li&gt;
&lt;li&gt;FP8 KV cache&lt;/li&gt;
&lt;li&gt;prefix caching&lt;/li&gt;
&lt;li&gt;max_num_seqs=1&lt;/li&gt;
&lt;li&gt;thinking enabled&lt;/li&gt;
&lt;li&gt;long timeouts&lt;/li&gt;
&lt;li&gt;no speculative decoding&lt;/li&gt;
&lt;li&gt;no child-agent swarm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you only test one prompt, you are testing the wrong thing.&lt;/p&gt;

&lt;p&gt;For coding agents, the real test is the loop.&lt;/p&gt;

&lt;p&gt;Anyone else running similar single-GPU 24GB agent loops? Curious what tricks worked for you on prefix caching or KV cache.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://x.com/@vllm_project" rel="noopener noreferrer"&gt;@vllm_project&lt;/a&gt; &lt;a href="https://x.com/@NousResearch" rel="noopener noreferrer"&gt;@NousResearch&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>vllm</category>
      <category>agents</category>
    </item>
    <item>
      <title>Monitoring HP HBA H240 with telegraf and grafana</title>
      <dc:creator>Xavier Rey-Robert</dc:creator>
      <pubDate>Wed, 28 Feb 2024 21:58:17 +0000</pubDate>
      <link>https://dev.to/xreyrobertibm/monitoring-hp-hba-h240-with-telegraf-and-grafana-4f0c</link>
      <guid>https://dev.to/xreyrobertibm/monitoring-hp-hba-h240-with-telegraf-and-grafana-4f0c</guid>
      <description>&lt;p&gt;I've recently got an HP SAS HBA H240 for my home lab to manage eight SAS SSD Pm1633a drives for better IOPs - who doesn't need that to run OpenShift at home right ? Given the HBA240's tendency to heat up, especially in a workstation setup, it's important to keep an eye on temperatures (Controller and SSDs).&lt;/p&gt;

&lt;p&gt;To tackle this, I wrote a simple Python script that parses SSA CLI output into JSON format. This makes it easy to feed the data into Telegraf, enabling straightforward monitoring with Grafana.&lt;/p&gt;

&lt;p&gt;Just a quick post to share this and for posterity...&lt;br&gt;
I don't get into telegraf and grafana now, just comment if you want telegraf conf / grafana panel.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gist.github.com/XReyRobert/f3d6177d2b50b4198ea9f8896437c5b8" rel="noopener noreferrer"&gt;https://gist.github.com/XReyRobert/f3d6177d2b50b4198ea9f8896437c5b8&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Effortlessly Exporting and Importing Podman Volumes Across Hosts</title>
      <dc:creator>Xavier Rey-Robert</dc:creator>
      <pubDate>Sun, 18 Feb 2024 12:18:35 +0000</pubDate>
      <link>https://dev.to/xreyrobertibm/effortlessly-exporting-and-importing-podman-volumes-across-hosts-2ph4</link>
      <guid>https://dev.to/xreyrobertibm/effortlessly-exporting-and-importing-podman-volumes-across-hosts-2ph4</guid>
      <description>&lt;h2&gt;
  
  
  Effortlessly Exporting and Importing Podman Volumes Across Hosts
&lt;/h2&gt;

&lt;p&gt;Hey folks, let's tackle a common hiccup in managing Podman volumes remotely. If you've tried using &lt;code&gt;podman volume export&lt;/code&gt; with a remote Podman client, you've likely noticed it's not directly supported. But, I've crafted a workaround that simplifies the process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2yygppv2xavzo9e7hld.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2yygppv2xavzo9e7hld.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge
&lt;/h3&gt;

&lt;p&gt;Working remotely and need to move a Podman volume from one server to another? You'll quickly find that &lt;code&gt;podman volume export&lt;/code&gt; isn't designed for remote client operations.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Workaround
&lt;/h3&gt;

&lt;p&gt;The solution lies in two Bash scripts that utilize SSH, SCP, and Podman's capabilities to facilitate volume export and import across remote hosts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exporting Volumes Made Simple
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;podman_remote_volume_export.sh&lt;/code&gt; script connects to your remote host via SSH, exports the specified Podman volume to a tarball, and then SCPs this tarball back to your local machine. It's a straightforward way to get your volume data where you need it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Importing as easy
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;podman_remote_volume_import.sh&lt;/code&gt; script then takes over, uploading the exported tarball to a different remote host. It checks for existing volumes (offering an option to overwrite) and imports the volume data efficiently.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Few Considerations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Safety Checks&lt;/strong&gt;: To prevent accidental data loss, there's a prompt for confirmation before overwriting existing volumes during the import process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean as We Go&lt;/strong&gt;: Both scripts clean up after themselves, removing temporary tarballs to keep your hosts tidy.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Usage
&lt;/h3&gt;

&lt;p&gt;To run these scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./podman_remote_volume_export.sh user@remotehost volume_name
./podman_remote_volume_import.sh user@remotehost new_volume_name /path/to/archive.tar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Bottom Line
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;podman volume export&lt;/code&gt; limitation for remote operations can be circumvented with these scripts, streamlining the process of migrating volumes. Designed for developers familiar with container management, they offer a practical solution to a common problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  CODE
&lt;/h3&gt;

&lt;p&gt;For a closer look and potential customization, the scripts are available on Gist: &lt;a href="https://gist.github.com/XReyRobert/aaec1a69eb38f54d869c6b5447babb20" rel="noopener noreferrer"&gt;Podman Volume Management Scripts Gist&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Happy containerizing!&lt;/p&gt;

&lt;p&gt;Edit 02/19/24:&lt;br&gt;
Just added podman_remote_volume_migrate.sh to do one or multiple volumes in one shot...&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;podman_remote_volume_migrate.sh
Usage: /Volumes/UsersData_Macos/Users/xav/scripts/podman_remote_volume_migrate.sh &amp;lt;SOURCE_HOST&amp;gt; &amp;lt;DESTINATION_HOST&amp;gt; &amp;lt;VOLUME_NAME&amp;gt;...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
    </item>
    <item>
      <title>Quick hack to use multiple instances of Newtek NDI Scan Converter on MacOS</title>
      <dc:creator>Xavier Rey-Robert</dc:creator>
      <pubDate>Sat, 29 Aug 2020 18:32:54 +0000</pubDate>
      <link>https://dev.to/xreyrobertibm/quick-hack-to-use-multiple-instances-of-newtek-ndi-scan-converter-on-macos-10eb</link>
      <guid>https://dev.to/xreyrobertibm/quick-hack-to-use-multiple-instances-of-newtek-ndi-scan-converter-on-macos-10eb</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fp1fq62c7dj18kkrj9dkg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fp1fq62c7dj18kkrj9dkg.png" alt="Alt Text" width="800" height="728"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'm prepping for some upcoming education sessions and I ran into issue needing multiple NDI video streams out of my mac applications and so I need a way to &lt;strong&gt;overcome the limitation to one single feed of the NewTek NDI Scan Converter App&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you stream live from your PC, for Videoconferencing, teaching or gaming, you might already know the free NDI tools from NewTek and the great addition they can be to OBS.&lt;/p&gt;

&lt;p&gt;Check NewTek Ndi Tools &lt;a href="https://ndi.tv/tools/" rel="noopener noreferrer"&gt;Download here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This quick post is to showcase how to allow multiple instances of Newtek scan converter to run on Mac. You can then have multiple apps, broadcasted as multiple NDI streams. In the screenshot above you can see 3 NDI feeds, one from an iPhone Cam, one from a Terminal and one from 3D Heavens benchmark, all displayed at once in OBS.&lt;/p&gt;

&lt;p&gt;While It's easy to spawn more than one &lt;em&gt;Scan Converter&lt;/em&gt; App (open -n command line), the NDI stream name is hard coded to "Scan Converter" and therefore the two instances outputs are conflicting (and only one is showing)&lt;/p&gt;

&lt;p&gt;So I came up with the following procedure to make things working:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate &lt;em&gt;NewTek NDI Scan Converter&lt;/em&gt; app and rename it to whatever you like (for me bellow &lt;em&gt;Hacked Scan Converter&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You will need a binary editor, you can get &lt;a href="https://ridiculousfish.com/hexfiend/" rel="noopener noreferrer"&gt;Hex Friend&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open then App package and look for the app binary: &lt;em&gt;-&amp;gt;Contents-&amp;gt;MacOS-&amp;gt;NewTek NDI Scan Converter&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Open it with HexFiend and search for the Hex sequence :  "53 63 61 6E 20 43 6F 6E 76 65 72 74 65 72 00 61 70 70 6C 69 63 61 74 69 6F 6E 4E 61 6D 65" (which is Scan Converter/00applicationName ). This is the string that is used for the NDI stream name.&lt;/li&gt;
&lt;li&gt;Change &lt;em&gt;Scan Converter&lt;/em&gt; to something like &lt;em&gt;Hacked Scan 01&lt;/em&gt; (It has to be the &lt;strong&gt;exact&lt;/strong&gt; same length)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As we've modified the app, the app signature is now invalid so we'll just get rid of it with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;codesign --remove-signature '/Applications/hacked Scan Converter.app'&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we have to change the bundle info so that both app wont interfere in security settings&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;edit &lt;em&gt;/Applications/Hacked Scan Converter.app/Contents/Info.plist&lt;/em&gt; and change *
CFBundleName* and &lt;em&gt;CFBundleIdentifier&lt;/em&gt; values to reflect new name:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;key&amp;gt;CFBundleName&amp;lt;/key&amp;gt;
&amp;lt;string&amp;gt;Hacked NDI Scan Converter&amp;lt;/string&amp;gt;
&amp;lt;key&amp;gt;CFBundleIdentifier&amp;lt;/key&amp;gt;
&amp;lt;string&amp;gt;com.hacked.Application-Mac-NDI-ScanConverter&amp;lt;/string&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Start both apps and make sure they have the right permissions in &lt;em&gt;Security-&amp;gt;Privacy-&amp;gt;Screen Recording&lt;/em&gt; settings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You should now see &lt;em&gt;Scan Converter&lt;/em&gt; and &lt;em&gt;Hacked Scan 01&lt;/em&gt; NDI sources available in NDI Monitor or other NDI apps. &lt;/p&gt;

&lt;p&gt;Enjoy!&lt;/p&gt;

&lt;p&gt;You can repeat the steps above changing the names if you need more than 2 NDI Scan converter app streams simultaneously...&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Gigabyte GA-X79-UP4 rev 1.1 with Xeon E5 2697v2 - 12 cores 24 threads</title>
      <dc:creator>Xavier Rey-Robert</dc:creator>
      <pubDate>Tue, 04 Aug 2020 22:02:39 +0000</pubDate>
      <link>https://dev.to/xreyrobertibm/gigabyte-ga-x79-up4-with-xeon-e5-2697-v2-12-cores-24-threads-297i</link>
      <guid>https://dev.to/xreyrobertibm/gigabyte-ga-x79-up4-with-xeon-e5-2697-v2-12-cores-24-threads-297i</guid>
      <description>&lt;p&gt;This is a short post for people with &lt;a href="https://www.gigabyte.com/Motherboard/GA-X79-UP4-rev-11#ov" rel="noopener noreferrer"&gt;Gigabyte X79-UP4&lt;/a&gt; wondering if they can upgrade their CPU to a 12 cores &lt;a href="https://ark.intel.com/content/www/us/en/ark/products/75283/intel-xeon-processor-e5-2697-v2-30m-cache-2-70-ghz.html" rel="noopener noreferrer"&gt;Xeon E5 2697v2&lt;/a&gt;. It's probably not interesting for the rest of the world! As I found absolutely no success story online with this motherboard/cpu combination, I drop it here for the archives :)&lt;/p&gt;

&lt;p&gt;In 2014, I made myself a decent setup with a X79-UP4 and a &lt;strong&gt;i7-4930K&lt;/strong&gt; 6 cores CPU. Six years later in 2020 it is still a very nice workhorse and doesn't pale in comparison to more modern setups. As an example the 2018 Macbook pro 6 cores i7 I'm using for work is far bellow in terms of performance under load (mainly due to thermal throttling). The 4930k is still a really appreciated CPU, and overclocking it under (simple) water cooling I can get 6 cores running altogether at 4.3Ghz easily.&lt;/p&gt;

&lt;p&gt;Just recently, after upgrading my GPU for a &lt;a href="https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt" rel="noopener noreferrer"&gt;Radeon RX5700XT&lt;/a&gt; and memory +32GB, I started to wonder if I could get a better CPU for my setup. So I started looking towards the Xeon E5 line. &lt;/p&gt;

&lt;p&gt;When I picked the &lt;a href="https://ark.intel.com/content/www/us/en/ark/products/77780/intel-core-i7-4930k-processor-12m-cache-up-to-3-90-ghz.html" rel="noopener noreferrer"&gt;i7 4930k&lt;/a&gt; in 2014 it was priced at $600 but at the top of the line of the Ivy bridge CPUs was standing the &lt;a href="https://ark.intel.com/content/www/us/en/ark/products/75283/intel-xeon-processor-e5-2697-v2-30m-cache-2-70-ghz.html" rel="noopener noreferrer"&gt;Xeon E5 2697-v2&lt;/a&gt; - 12 Cores, 24 threads - for a bit less than $3000! It was the best CPU you could fit in the $6000+ &lt;a href="https://support.apple.com/kb/SP697?locale=fr_FR" rel="noopener noreferrer"&gt;2013 Mac PRO&lt;/a&gt; I was drooling on at the time.&lt;/p&gt;

&lt;p&gt;The e5 2697v2 is still sold by Intel new at $2000, but you can find used ones for much less. I picked mine for $170 &lt;a href="https://fr.aliexpress.com/wholesale?catId=0&amp;amp;initiative_id=SB_20200804134853&amp;amp;SearchText=xeon+2697+v2" rel="noopener noreferrer"&gt;directly from China!&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When I checked Gigabyte specifications I realised that &lt;br&gt;
unfortunately, &lt;strong&gt;the Xeon E5-2697v2 was not on the list of &lt;a href="https://www.gigabyte.com/Motherboard/GA-X79-UP4-rev-11/support#support-cpu" rel="noopener noreferrer"&gt;supported CPUs&lt;/a&gt;&lt;/strong&gt;. Strangely all the CPUs of the Ivy Bridge-E family are there but not this one (and a few others). I contacted Gigabyte support and got an answer like &lt;em&gt;"If it's not on the list, it's not supported. We recommend using CPU's from the list"&lt;/em&gt;. Fine, but not supported because not-tested, or tested and not working? 3 weeks later, the request is still open and not properly answered... Congrats support.&lt;/p&gt;

&lt;p&gt;Well, 3 weeks, was actually the time needed for the CPU to arrive from China, at this price I didn't wait and decided I'd take the risk to try by myself ! I could see no reasons why all the family would work but not this one...&lt;/p&gt;

&lt;p&gt;and I was right ! &lt;strong&gt;It's perfectly working&lt;/strong&gt;! and as I'm on a summer vacation, I take some time to tell the world about it or at least drop the information here just in case someone else is googling on the same path. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0isewiiiy6swxy568ef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0isewiiiy6swxy568ef.png" alt="CPU-Z validation" width="403" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So is it really an upgrade from an overclocked 4930k ? hmmm not an easy answer.&lt;/p&gt;

&lt;p&gt;My Geekbench score for the overclocked &lt;strong&gt;4930k was 975 single core and 5884 multicore&lt;/strong&gt; (all 6 cores running at 4.3 Ghz). The &lt;strong&gt;non-overclocked E5 2697v2 is a bit disappointing with scores of 678 single core and 6439 multi-core&lt;/strong&gt;. That's respectively -30% and +9.5%.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;E5 2697v2 - like most Xeons - is locked&lt;/strong&gt; and therefore not &lt;strong&gt;easily&lt;/strong&gt; overclockable. I started playing with the bus clock which is the only way to squeeze a little more juice out of it and I got honourable &lt;strong&gt;759 single core and 8480 multicore scores&lt;/strong&gt;. Respectively -22% and +32% / 4930k oc. But to make things worse the 113 MHz bus clock boost led to some instability with my GPU...&lt;/p&gt;

&lt;p&gt;Of course with a non-overclocked 4930k that would be a different story and to be fair I've been using my 4930k at specifications speed for 6 years totally satisfied and just started overclocking it few weeks ago only because I was going to receive the new CPU. It's been running rock steady on overclock since then though.&lt;/p&gt;

&lt;p&gt;On the temperature side, under standard/idle use (browsing/video) I reach 40° with the xeon when it was 60° with the oc 4930k. Under heavy load (Cinebench) I would reach 80° with the 4930k and I top a 60° with the Xeon...&lt;/p&gt;

&lt;p&gt;Using my system, I definitely cannot feel the -30% penalty on single core performance and for some workloads, it might still be a nice improvement. Compiling Tensorflow was taking about 1h to compile, I will try and see how much it takes now.&lt;/p&gt;

&lt;p&gt;Oh but wait, I'm just reading there is the &lt;strong&gt;Xeon E5 1680v2&lt;/strong&gt; -8 cores, 16 thread - with one interesting particularity in the Xeon line... &lt;strong&gt;he's not unlocked&lt;/strong&gt; and I have the feeling that with this one one could beat the single core performance of the 4930k and probably get close to the multicore performance of the 12 cores E5 2697v2 when overclocked ! &lt;a href="http://cpuboss.com/cpus/Intel-Xeon-E5-2697-v2-vs-Intel-Xeon-E5-1680-v2" rel="noopener noreferrer"&gt;See how close they are non overclocked&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So I guess I could sell my 4930k and order a E5 1680v2, just to try.... or just stop here and wait...&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Machine learning on macOs using Keras -&gt; Tensorflow (1.15.0) -&gt; nGraph -&gt; PlaidML -&gt; AMD GPU</title>
      <dc:creator>Xavier Rey-Robert</dc:creator>
      <pubDate>Thu, 23 Jul 2020 08:46:06 +0000</pubDate>
      <link>https://dev.to/xreyrobertibm/machine-learning-on-macos-using-keras-tensorflow-1-15-0-ngraph-plaidml-amd-gpu-l4j</link>
      <guid>https://dev.to/xreyrobertibm/machine-learning-on-macos-using-keras-tensorflow-1-15-0-ngraph-plaidml-amd-gpu-l4j</guid>
      <description>&lt;p&gt;Since the unavailability of Cuda on macOS, choices to use GPUs for Machine learning on Macs are sparse.&lt;/p&gt;

&lt;p&gt;After failing to find some practical ways to do it, I resorted to use a second Linux computer with an Nvidia GPU for training my networks.&lt;/p&gt;

&lt;p&gt;The availability of macOS Catalina with &lt;a href="https://support.apple.com/en-us/HT208544" rel="noopener noreferrer"&gt;Apple support for Navi AMD GPUs&lt;/a&gt; incited me to give it another try. This was quite tough so I decided to write it down to share the experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The easy way: Keras with &lt;a href="https://github.com/plaidml/plaidml" rel="noopener noreferrer"&gt;PlaidML&lt;/a&gt; - &lt;em&gt;No tensorflow involved&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;This is quite straight forward and I'm not going to cover it again here. You can check this article here : &lt;a href="https://medium.com/@bamouh42/gpu-acceleration-on-amd-with-plaidml-for-training-and-using-keras-models-57a9fce883b9" rel="noopener noreferrer"&gt;https://medium.com/@bamouh42/gpu-acceleration-on-amd-with-plaidml-for-training-and-using-keras-models-57a9fce883b9&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In my case that was not satisfying. Here Keras is using PlaidML as a backend and I want to be able to use &lt;a href="https://github.com/keunwoochoi/kapre" rel="noopener noreferrer"&gt;Kapre&lt;/a&gt; which &lt;strong&gt;requires a tensorflow backend&lt;/strong&gt;. &lt;a href="https://github.com/keunwoochoi/kapre" rel="noopener noreferrer"&gt;Kapre&lt;/a&gt; is a neat library providing keras layers to calculate melspectrograms on the fly. &lt;/p&gt;

&lt;p&gt;Be aware that " &lt;a href="https://devclass.com/2019/09/18/another-one-bites-the-dust-keras-team-steps-away-from-multi-backends-refocuses-on-tf-keras/" rel="noopener noreferrer"&gt;Keras team steping away from multi-backends&lt;/a&gt; " so &lt;strong&gt;the Keras -&amp;gt; PlaidML&lt;/strong&gt; approach might be a dead end anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The journey to Tensorflow execution on mac GPUs / eGPUs
&lt;/h2&gt;

&lt;p&gt;The key element here is &lt;a href="https://github.com/NervanaSystems/ngraph" rel="noopener noreferrer"&gt;nGraph&lt;/a&gt;. Without entering into details, nGraph is pursuing a neutral approach in supporting multiple frameworks &lt;em&gt;(Tensorflow, ONNX, etc.)&lt;/em&gt; and multiple  hardware targets &lt;em&gt;(Intel CPU, NNPs, etc)&lt;/em&gt; and luckily for us (not so! just wait) nGraph was also integrated with PlaidML to offer support for GPUs &lt;em&gt;(Intel, Nvidia and... AMD)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0nk1wnyzxdamlsynnjqw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0nk1wnyzxdamlsynnjqw.png" alt="ngraph-ecosystem" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So on paper all is great, we have a way to go: &lt;br&gt;
&lt;strong&gt;Keras -&amp;gt; Tensorflow -&amp;gt; nGraph -&amp;gt; nGraph-bridge -&amp;gt; PlaidML -&amp;gt; Metal -&amp;gt; AMD GPU&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this domain like others, things are moving fast. So fast that it's not allways easy to keep pace and for the teams of those projects it's the same. There are a lot of involved sofware and things are changing so fast that developpers don't have time - or take time - to settle things down. &lt;/p&gt;

&lt;p&gt;nGraph-bridge team hasn't been doing proper releases since August 2019 (v0.18.1) and while they are still activily working on the project they seem to have been focusing on big refactoring. &lt;/p&gt;

&lt;p&gt;To make things worse &lt;strong&gt;PlaidML support was (silently) dropped from nGraph&lt;/strong&gt; in April without much explanations or warning so forget about using the latest github master to try to sort it out ! I spend hours wondering why it wasn't working when it was simply not there anymore.&lt;/p&gt;
&lt;h4&gt;
  
  
  Why was PlaidML bridge droped ?
&lt;/h4&gt;

&lt;p&gt;It seems that the futur &lt;em&gt;path to hapyness&lt;/em&gt; will be &lt;strong&gt;Keras -&amp;gt; Tensorflow -&amp;gt; Mlir -&amp;gt; PlaidMl -&amp;gt; ...&lt;/strong&gt; and all are preping for the jump when Mlir as tensorflow backend will be released ... &lt;strong&gt;in 2021!&lt;/strong&gt; but as of today users are just left hanging in midair.&lt;/p&gt;
&lt;h4&gt;
  
  
  What are your options ?
&lt;/h4&gt;

&lt;p&gt;At time of writing the latest release is ngraph-bridge v0.18.1 (dated 20 Aug 2019!). It's using tensorflow v1.14.0 - Argh! Kapre requirement is tensorflow v1.15 - Dead end again.&lt;/p&gt;

&lt;p&gt;I should mention that &lt;strong&gt;you should better not use prebuilt wheels&lt;/strong&gt;. I realized not all are compiled with PlaidML backend support. &lt;strong&gt;So your best chance is to Build nGraph and nGraph-bridge from sources&lt;/strong&gt; and you'd rather have all stars aligned for that to happend flawlessly. A lot of things can go wrong: Python versions, bazel versions, libraries incompatibilities, bugs to fix in the code etc... all joys of pythons&lt;/p&gt;
&lt;h4&gt;
  
  
  Picking a release candidate to build
&lt;/h4&gt;

&lt;p&gt;v0.19.0-rc9 brings Tensorflow v1.15.0, nGraph 0.28.0-rc1 - the recommended last stable baseline - is Tensorflow v1.14.0&lt;/p&gt;

&lt;p&gt;I need TF15 so let's try with &lt;strong&gt;v.0.19.0-rc10&lt;/strong&gt; then... of course standard build miserably crash which lead me to think that this rc was probably never compiled/tested with plaidml support on mac as clang fails because of a non complete switch statement in plaidml_translate.cpp&lt;/p&gt;

&lt;p&gt;We will fix it by adding this line to the to the switch(dt) in the tile_converter function:&lt;br&gt;
&lt;br&gt;
&lt;code&gt;case PLAIDML_DATA_BFLOAT16: return "as_bfloat16(" + tensor_name + ", 16)";&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;See The complete build instructions bellow.&lt;/p&gt;

&lt;p&gt;If everything goes right you should end up with something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TensorFlow version:  1.15.0
C Compiler version used in building TensorFlow:  4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)
nGraph bridge version: b'0.19.0-rc10'
nGraph version used for this build: b'0.25.1-rc.10+90c70dd'
TensorFlow version used for this build: v1.15.0-rc3-22-g590d6eef7e
CXX11_ABI flag used for this build: 0
nGraph bridge built with Grappler: False
nGraph bridge built with Variables and Optimizers Enablement: False
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Final thoughts - Use at your own risks
&lt;/h4&gt;

&lt;p&gt;Ok, we have a working environment but they are so many imbricated (fresh) software bricks that we have no garantee that all this will run properly in all circumstances.&lt;br&gt;
Using Kapre for exemple, I'm able to use the &lt;strong&gt;_mel_spectrogram&lt;/strong&gt;_ layer just fine, but ngraph-bridge&lt;br&gt;
will crash on a &lt;em&gt;Caught exception while executing nGraph computation: syntax error&lt;/em&gt; when trying to use the STFT layer... &lt;/p&gt;

&lt;p&gt;I will not abandon quite yet my linux deep learning work horse but at least I have an environment to try out that will use my Macbook pro GPU on the go and my Catalina / AMD RX 5700 XT setup at home.&lt;/p&gt;
&lt;h4&gt;
  
  
  The complete build instructions
&lt;/h4&gt;

&lt;p&gt;I'm putting bellow what worked for me - I retested on a fresh mac after days of messing up -&lt;/p&gt;

&lt;p&gt;Make sure you have a proper python3 installation (I wont cover it). I'm using 3.7 and using ‘‘‘brew install &lt;a href="mailto:python@3.7"&gt;python@3.7&lt;/a&gt; to manage it.‘‘‘&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/tensorflow/ngraph-bridge.git
cd ngraph-bridge

git checkout v0.19.0-rc10

# Install bazel (bazelisk was a mess)
export BAZEL_VERSION=0.25.2 

curl -LO "https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-darwin-x86_64.sh"

chmod +x "bazel-${BAZEL_VERSION}-installer-darwin-x86_64.sh"
./bazel-${BAZEL_VERSION}-installer-darwin-x86_64.sh --user

source ~/.bazel/bin/bazel-complete.bash

# Add $HOME/bin to your PATH in .zshrc (or .bashrc) and source it

echo "\nexport PATH=$PATH:$HOME/bin" &amp;gt;&amp;gt; ~/.zshrc
source ~/.zshrc

# check bazel 
bazel version

# I like to start with a fresh venv dedicated to the build

python3 -m venv build-venv
source build-venv/bin/activate

# Recommended virtualenv v16.0.0 didn't work, I ended up using latest version

python3 -m pip3 install virtualenv

#Install tensorflow from wheel (find the right one here: https://pypi.org/project/tensorflow/1.15.0/#files)

python3 -m pip install https://files.pythonhosted.org/packages/dc/65/a94519cd8b4fd61a7b002cb752bfc0c0e5faa25d1f43ec4f0a4705020126/tensorflow-1.15.0-cp37-cp37m-macosx_10_11_x86_64.whl

#start the build

python3 build_ngtf.py --use_prebuilt_tensorflow --build_plaidml_backend

# When the build fails edit plaidml_translate.cpp from ngraph to add the missing case 

vi /build_cmake/ngraph/src/ngraph/runtime/plaidml/plaidml_translate.cpp 

#re-start the build

python3 build_ngtf.py --use_prebuilt_tensorflow --build_plaidml_backend

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Some hints for the records:
&lt;/h4&gt;

&lt;p&gt;When installing Kapre you might run into&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AttributeError: module 'enum' has no attribute 'IntFlag&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;This is solved by removing enum34:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;enum34 1.1.10&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;When importing Librosa, you might run into:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ModuleNotFoundError: No module named 'numba.decorators&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;This is solved by using an older version of numba:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install numba==0.48&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

</description>
      <category>keras</category>
      <category>plaidml</category>
      <category>ngraph</category>
      <category>amd</category>
    </item>
  </channel>
</rss>
