<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Silvestre</title>
    <description>The latest articles on DEV Community by Silvestre (@silvestre-po).</description>
    <link>https://dev.to/silvestre-po</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3648562%2F6dab7d5b-8051-4f33-9b6a-72b7a498d939.jpg</url>
      <title>DEV Community: Silvestre</title>
      <link>https://dev.to/silvestre-po</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/silvestre-po"/>
    <language>en</language>
    <item>
      <title>Deploying GLM-5.2-FP8 (700B MoE) on Modal: Serverless 8x H200s, Trade-offs, and Lessons Learned</title>
      <dc:creator>Silvestre</dc:creator>
      <pubDate>Mon, 22 Jun 2026 23:26:21 +0000</pubDate>
      <link>https://dev.to/silvestre-po/deploying-glm-52-fp8-700b-moe-on-modal-serverless-8x-h200s-trade-offs-and-lessons-learned-4m7i</link>
      <guid>https://dev.to/silvestre-po/deploying-glm-52-fp8-700b-moe-on-modal-serverless-8x-h200s-trade-offs-and-lessons-learned-4m7i</guid>
      <description>&lt;p&gt;The release of &lt;strong&gt;GLM-5.2&lt;/strong&gt; by Zhipu AI is a significant development in open-weights AI: a Mixture-of-Experts (MoE) reasoning model optimized for long-horizon planning, complex software engineering, and high-density reasoning.&lt;/p&gt;

&lt;p&gt;According to recent benchmarks like SWE-bench Pro and GPQA, &lt;strong&gt;GLM-5.2 stands as the most capable open-source LLM available on the market today&lt;/strong&gt;, matching or exceeding proprietary models like Claude 3.5 Sonnet and GPT-4o on engineering tasks.&lt;/p&gt;

&lt;p&gt;However, self-hosting a model of this scale—which logs in at a massive &lt;strong&gt;703.74 GiB FP8 checkpoint&lt;/strong&gt;—requires orchestrating an &lt;strong&gt;8x NVIDIA H200 GPU cluster&lt;/strong&gt; (141GB HBM3e each) to support the model weights and its 131k token context window.&lt;/p&gt;

&lt;p&gt;Renting a dedicated 8x H200 (141GB) node on clouds like &lt;a href="https://www.runpod.io/pricing" rel="noopener noreferrer"&gt;RunPod&lt;/a&gt; costs &lt;strong&gt;$35.12 per hour ($4.39/GPU/hr)&lt;/strong&gt;, while &lt;a href="https://modal.com/pricing" rel="noopener noreferrer"&gt;Modal&lt;/a&gt; charges &lt;strong&gt;$36.31 per hour ($4.54/GPU/hr or $0.001261/GPU/sec)&lt;/strong&gt;. However, because Modal bills strictly by the second and automatically scales the cluster to zero when idle, a typical 20-minute active development session—including the cold start and scale-down idle wait—costs &lt;strong&gt;only ~$12.00&lt;/strong&gt;, dropping to exactly &lt;strong&gt;$0.00/hour&lt;/strong&gt; when inactive without requiring manual intervention.&lt;/p&gt;

&lt;p&gt;This case study documents the serverless deployment architecture on Modal using &lt;strong&gt;vLLM&lt;/strong&gt;, the technical bottlenecks encountered, and the practical lessons learned during the integration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Under the Hood: GLM-5.2 &amp;amp; Quantization Trade-offs
&lt;/h2&gt;

&lt;p&gt;Deploying a model of this scale on a single 8-GPU node requires careful memory layout planning. Serving the original 16-bit (BF16) weights is mathematically impossible on a single node (requiring over 1.5 Terabytes of VRAM and multi-node pipeline-parallel orchestration). &lt;/p&gt;

&lt;p&gt;We are left with multiple quantization formats. Here is the architectural trade-off analysis:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format / Precision&lt;/th&gt;
&lt;th&gt;VRAM Required (Weights + Cache)&lt;/th&gt;
&lt;th&gt;Compute Hardware Path&lt;/th&gt;
&lt;th&gt;Accuracy Retention&lt;/th&gt;
&lt;th&gt;Throughput &amp;amp; Latency Trade-off&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BF16 (Unquantized)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1.5 TB (Requires 16x H200 GPUs)&lt;/td&gt;
&lt;td&gt;Slower (Overhead of multi-node PP)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100% (Baseline)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slowed by inter-node network communication bottlenecks. High hosting cost.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;INT8 (W8A8 Integer)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~750 GB (Fits on 8x H200)&lt;/td&gt;
&lt;td&gt;Standard Tensor Cores&lt;/td&gt;
&lt;td&gt;High (No visible degradation)&lt;/td&gt;
&lt;td&gt;Slower execution. Int8 kernels require runtime casting and lack the hardware-level optimization of Hopper's FP8 Tensor Cores.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FP8 (Z-AI Native FP8)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~700 GB&lt;/strong&gt; (Fits on 8x H200)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Hopper Native FP8 Tensor Cores&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (DeepGEMM preserves routing quality)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Optimal choice.&lt;/strong&gt; Leverages native hardware Tensor Cores, yielding 1.5x-2x faster token generation than Int8/BF16, with negligible accuracy loss.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;INT4 (W4A16 / Legacy)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~400 GB (Fits on 4x H200)&lt;/td&gt;
&lt;td&gt;Standard Tensor Cores&lt;/td&gt;
&lt;td&gt;Low (Severe reasoning loss)&lt;/td&gt;
&lt;td&gt;Fast generation but suffers massive accuracy degradation in complex coding and reasoning benchmarks.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  Relative Accuracy Retention vs. Quantization Format
&lt;/h3&gt;

&lt;p&gt;This visual representation shows how each quantization format retains the model's baseline intelligence (relative to BF16) on complex reasoning benchmarks (like GPQA and SWE-bench):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Format   VRAM Req.    Relative Accuracy Retention (%)
───────  ─────────   ───────────────────────────────────────────────
BF16      ~1.5 TB    [████████████████████████████████████████] 100.0% (Baseline)
FP8       ~700 GB    [██████████████████████████████████████░]  99.2% (DeepGEMM Optimized)
INT8      ~750 GB    [█████████████████████████████████████░░]  98.6% (Standard W8A8)
INT4      ~400 GB    [██████████████████████████████░░░░░░░░]  91.4% (Severe Reasoning Loss)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Format Selection:&lt;/strong&gt; &lt;strong&gt;FP8&lt;/strong&gt; represents the optimal trade-off for self-hosting. It retains &lt;strong&gt;99.2%&lt;/strong&gt; of the model's raw intelligence, fits on a single 8-GPU node, reduces the Key-Value (KV) cache footprint in half, and leverages Hopper's native hardware Tensor Cores. Under the hood, vLLM utilizes &lt;strong&gt;DeepSeek's open-source DeepGEMM&lt;/strong&gt; library (which vLLM utilizes for GLM's MoE routing kernels) to execute the MoE routing matrix multiplications with highly optimized Triton paths.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Self-Host? The Strategic Decision Framework
&lt;/h2&gt;

&lt;p&gt;While managed multi-tenant API providers offer low friction and instant availability, self-hosting a model of this scale becomes a necessity under specific, highly technical scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Strict Codebase Privacy &amp;amp; IP Compliance:&lt;/strong&gt; If you are building proofs-of-concept (PoCs) or products in regulated environments (finance, healthcare, enterprise software), sending proprietary codebase chunks or sensitive client data to third-party API routers violates strict compliance protocols. Self-hosting on isolated, serverless GPU tenants ensures your intellectual property never crosses your secure network perimeter.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Bypassing Rate Limits and Context Throttling:&lt;/strong&gt; Running long-horizon, autonomous software engineering agents requires deep, repetitive context evaluation (SWE-bench runs). Third-party APIs heavily throttle context sizes under concurrent loads or charge exorbitant premium fees. Owning the cluster guarantees that the entire 8x H200 compute power is exclusively yours, with zero artificial rate limiting.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Prefix Caching Stability:&lt;/strong&gt; On public multi-tenant APIs, your context cache gets evicted constantly as the provider balances load across thousands of concurrent users. When self-hosting, you control the GPU memory directly. Your RadixAttention prefix cache stays warm and stable throughout your entire development or evaluation session.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Infrastructure Blueprint
&lt;/h2&gt;

&lt;p&gt;To serve this model serverless, we must tightly control VRAM allocations, minimize startup network latency, and protect our cloud budget. &lt;/p&gt;

&lt;p&gt;Here is our complete Infrastructure-as-Code (IaC) configuration using Modal's Python SDK and a specialized vLLM build (&lt;code&gt;vllm/vllm-openai:glm52-cu129&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;modal&lt;/span&gt;

&lt;span class="n"&gt;vllm_image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;modal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_registry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vllm/vllm-openai:glm52-cu129&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;setup_dockerfile_commands&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RUN ln -sf $(which python3) /usr/local/bin/python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RUN rm -f /usr/local/lib/python3.12/dist-packages/typing_extensions.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entrypoint&lt;/span&gt;&lt;span class="p"&gt;([])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pip_install&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aiohttp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;typing-extensions&amp;gt;=4.15.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HF_XET_HIGH_PERFORMANCE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VLLM_LOG_STATS_INTERVAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;modal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;App&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glm5-2-inference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vllm_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;H200:8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Protect your budget from accidental parallel scaling
&lt;/span&gt;    &lt;span class="n"&gt;scaledown_window&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Scale to zero after 15 minutes of inactivity
&lt;/span&gt;    &lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;modal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;huggingface&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# Required to fetch weights!
&lt;/span&gt;        &lt;span class="n"&gt;modal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Secret&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vllm-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# Enforce Bearer Token Auth
&lt;/span&gt;    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;volumes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/root/.cache/huggingface&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;modal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Volume&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;huggingface-cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;create_if_missing&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/root/.cache/vllm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;modal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Volume&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vllm-cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;create_if_missing&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@modal.web_server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;startup_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;serve&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vllm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;serve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zai-org/GLM-5.2-FP8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--served-model-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glm-5.2-fp8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Enforce listening on 0.0.0.0 for Modal proxy routing!
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--uvicorn-log-level&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--async-scheduling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--tensor-parallel-size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--kv-cache-dtype&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fp8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--max-model-len&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;131072&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# 131k context window for stability and VRAM headroom
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--max-num-seqs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Limits concurrent sequence VRAM pre-allocation
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--gpu-memory-utilization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.92&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--trust-remote-code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--speculative-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mtp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_speculative_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 5}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--tool-call-parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glm47&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--reasoning-parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glm45&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--enable-auto-tool-choice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--safetensors-load-strategy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefetch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--enable-prefix-caching&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--enforce-eager&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# Crucial serverless boot parameter
&lt;/span&gt;    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VLLM_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Popen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Technical Post-Mortems &amp;amp; Resolutions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Python Module Shadowing (&lt;code&gt;typing_extensions&lt;/code&gt;)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Symptom:&lt;/strong&gt; During initial container boot, the vLLM engine crash-looped with:
&lt;code&gt;ImportError: cannot import name 'Sentinel' from 'typing_extensions' (/usr/local/lib/python3.12/dist-packages/typing_extensions.py)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Diagnosis:&lt;/strong&gt; &lt;code&gt;pydantic-core&lt;/code&gt; requires &lt;code&gt;typing-extensions&amp;gt;=4.14.1&lt;/code&gt; for the &lt;code&gt;Sentinel&lt;/code&gt; class. Even though we installed &lt;code&gt;typing-extensions&amp;gt;=4.15.0&lt;/code&gt; during the build step, the base CUDA image shipped with a legacy single-file module (&lt;code&gt;typing_extensions.py&lt;/code&gt;) in &lt;code&gt;dist-packages&lt;/code&gt;. Because Python's import system prioritizes single-file modules over package directories in the same search path, it was shadowing our modern package.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Resolution:&lt;/strong&gt; We added a step in our Dockerfile setup to delete the legacy single-file module prior to running pip:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;RUN &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /usr/local/lib/python3.12/dist-packages/typing_extensions.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Safetensors Sequential I/O Bottleneck
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Symptom:&lt;/strong&gt; Startup profiling showed model loading taking over &lt;strong&gt;12 minutes&lt;/strong&gt; due to sequential reads over Modal's virtual network filesystem. vLLM logged the following warning:
&lt;code&gt;Auto-prefetch is disabled because the filesystem (9P) is not a recognized network FS... start vLLM with --safetensors-load-strategy=prefetch.&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Resolution:&lt;/strong&gt; We added the &lt;code&gt;--safetensors-load-strategy prefetch&lt;/code&gt; parameter. This forces vLLM to parallelize the disk-to-VRAM loading process using multiple CPU threads, which cut our model loading time from ~12 minutes down to &lt;strong&gt;~1 minute&lt;/strong&gt;, resulting in a total cold start of &lt;strong&gt;~4.5 minutes&lt;/strong&gt; (including hardware allocation and DeepGEMM warmup).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Speculative Decoding vs. Cold Start (MTP &amp;amp; Eager Mode)
&lt;/h3&gt;

&lt;p&gt;GLM-5.2 utilizes &lt;strong&gt;Multi-Token Prediction (MTP)&lt;/strong&gt; to speculate 5 tokens ahead. To make this serverless, we faced a major architectural choice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;CUDA Graphs / &lt;code&gt;torch.compile&lt;/code&gt; (No Eager Mode):&lt;/strong&gt; The server hung for &lt;strong&gt;&amp;gt;20 minutes&lt;/strong&gt; on startup compiling mathematical graphs for our massive context window. &lt;em&gt;Verdict: Infeasible for serverless.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Eager Mode (&lt;code&gt;--enforce-eager&lt;/code&gt;):&lt;/strong&gt; Boot times dropped to our glorious 4.5 minutes, but the &lt;em&gt;first query&lt;/em&gt; with a novel sequence length suffered a &lt;strong&gt;~35-second Time-To-First-Token (TTFT) spike&lt;/strong&gt; while the MTP engine performed JIT Triton kernel compilation on the fly.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The Decision:&lt;/strong&gt; We chose Eager Mode. A 35s latency spike on the first interaction is a fair price to pay to avoid a 20-minute startup hang. Once warm, MTP acts with a &lt;strong&gt;100% draft acceptance rate&lt;/strong&gt; on structured code, providing sustainable generation speeds of &lt;strong&gt;30-50 tokens/s&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Practical Validation: Testing on OpenCode &amp;amp; The Flappy Bird Challenge
&lt;/h2&gt;

&lt;p&gt;To validate the deployment, we integrated the Modal server as an OpenAI-compatible provider in &lt;strong&gt;OpenCode&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;First, we tested context handling using a large, real-world file: CPython’s standard library asynchronous coordinator &lt;code&gt;asyncio/tasks.py&lt;/code&gt; (over 1,060 lines of complex concurrent logic). With &lt;code&gt;--reasoning-parser glm45&lt;/code&gt; active, the model’s Chain-of-Thought (CoT) tokens are routed into a dedicated &lt;code&gt;reasoning_content&lt;/code&gt; API property:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;opencode.json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"modal-glm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"npm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@ai-sdk/openai-compatible"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"baseURL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://&amp;lt;your-modal-workspace&amp;gt;--glm5-2-inference-serve.modal.run/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"glm5_sk_YOUR_SECURE_KEY"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"glm-5.2-fp8"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenCode captures this stream and renders the reasoning process inside a collapsible "Thinking" block in the chat. GLM-5.2 digested the 1,065 lines of code, parsed CPython's execution callbacks, and produced highly accurate, structured architectural analyses.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "One-Shot" Coding Test: Sunset Flier (Flappy Bird)
&lt;/h3&gt;

&lt;p&gt;To stress-test the model's creative capability, logic coherence, and syntax closure in a single-pass context, we executed the following prompt:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fnvcvtpr9n3xn5g15t6p6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fnvcvtpr9n3xn5g15t6p6.png" alt="Promting the Flappy Bird game and Thinking model" width="799" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The model successfully generated the game ("Sunset Flier") with outstanding attention to engineering detail:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Physics &amp;amp; Game Loop:&lt;/strong&gt; Clean canvas-based rendering with proper gravity acceleration, jump impulses, score counting, and high-score persistence in &lt;code&gt;localStorage&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;JS Audio Synthesis:&lt;/strong&gt; Instead of loading static &lt;code&gt;.mp3&lt;/code&gt; assets, the model utilized the &lt;strong&gt;HTML5 Web Audio API&lt;/strong&gt; to generate retro sound effects dynamically using oscillator nodes (sine, triangle, square, and sawtooth waveforms) for jumping (flapping), scoring, and crashing (hitting).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This demonstrated the model's capacity to compose interactive, functional software in a single pass.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftjj11wapt49nlww88i41.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftjj11wapt49nlww88i41.png" alt="The game works amazing" width="600" height="1025"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Future Optimizations
&lt;/h2&gt;

&lt;p&gt;To bring this serverless architecture to the next level of operational excellence, we have mapped out three future optimization vectors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Keep-Warm Scheduling (Active Sessions):&lt;/strong&gt; During active coding hours, a simple serverless &lt;code&gt;cron&lt;/code&gt; job can be configured in Modal to ping the &lt;code&gt;/health&lt;/code&gt; endpoint once every 14 minutes, avoiding the 4.5-minute cold start entirely.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;GPU Memory Snapshots:&lt;/strong&gt; Modal's GPU memory snapshotting technology allows serializing the post-DeepGEMM-warmup VRAM state directly to disk. Restoring a container from a pre-compiled state would bypass both weight-loading and JIT compilation, potentially dropping serverless cold starts to less than 10 seconds.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;SGLang Engine Migration:&lt;/strong&gt; Once SGLang natively supports GLM-5.2's custom MoE layers, migrating the backend from vLLM will help reduce CPU-side host overhead under Eager execution.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  References &amp;amp; Technical Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Model &amp;amp; Weights (Zhipu AI / Z-AI)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Base Model:&lt;/strong&gt; &lt;a href="https://huggingface.co/zai-org/GLM-5.2" rel="noopener noreferrer"&gt;zai-org/GLM-5.2 on HuggingFace&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;FP8 Quantized Checkpoint:&lt;/strong&gt; &lt;a href="https://huggingface.co/zai-org/GLM-5.2-FP8" rel="noopener noreferrer"&gt;zai-org/GLM-5.2-FP8 on HuggingFace&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Official Release Announcement:&lt;/strong&gt; &lt;a href="https://huggingface.co/blog/zai-org/glm-52-blog" rel="noopener noreferrer"&gt;GLM-5.2: Built for Long-Horizon Tasks (HF Blog)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;IndexShare Architecture Paper:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2603.12201" rel="noopener noreferrer"&gt;IndexShare: Sharing Indexer Across Sparse Attention Layers (arXiv:2603.12201)&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Core Libraries &amp;amp; Runtime Optimizations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Triton FP8 MoE Kernels:&lt;/strong&gt; &lt;a href="https://github.com/deepseek-ai/DeepGEMM" rel="noopener noreferrer"&gt;DeepSeek's DeepGEMM Library on GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;vLLM Official Serving Recipes:&lt;/strong&gt; &lt;a href="https://recipes.vllm.ai/zai-org/GLM-5.2" rel="noopener noreferrer"&gt;GLM-5.2 vLLM Recipe Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Serverless GPU Deployment:&lt;/strong&gt; &lt;a href="https://modal.com/docs/examples/vllm_inference" rel="noopener noreferrer"&gt;Modal vLLM Web Server Example&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Community Guides &amp;amp; Integrations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Unsloth Local Execution Guide:&lt;/strong&gt; &lt;a href="https://unsloth.ai/docs/models/glm-5.2" rel="noopener noreferrer"&gt;GLM-5.2 - How to Run Locally with Unsloth Studio&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;Self-hosting a 700B parameter reasoning model on serverless infrastructure is entirely viable today. The open-source tooling (vLLM, Modal, HuggingFace) is mature enough to give individual developers access to frontier-level intelligence with total data privacy. &lt;/p&gt;

&lt;p&gt;By understanding the trade-offs of JIT compilation, implementing aggressive I/O prefetching, and leveraging prefix caching, we can serve frontier open-weights AI at a fraction of a dollar per session.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>serverless</category>
      <category>llm</category>
      <category>learning</category>
    </item>
    <item>
      <title>From "Why?" to Wow: Building a Multi-Agent Storyteller After 5-Day AI Agents Intensive Course with Google</title>
      <dc:creator>Silvestre</dc:creator>
      <pubDate>Wed, 10 Dec 2025 05:58:06 +0000</pubDate>
      <link>https://dev.to/silvestre-po/from-why-to-wow-building-a-multi-agent-storyteller-after-5-day-ai-agents-intensive-course-with-1lm7</link>
      <guid>https://dev.to/silvestre-po/from-why-to-wow-building-a-multi-agent-storyteller-after-5-day-ai-agents-intensive-course-with-1lm7</guid>
      <description>&lt;h2&gt;
  
  
  My "Aha!" Moment: AI Agents Are More Than Just Chatbots
&lt;/h2&gt;

&lt;p&gt;Before the 5-Day AI Agents Intensive, my view of AI agents was largely centered around conversational interfaces—smart chatbots that could answer questions. The course completely shattered that perception. My key takeaway, and the concept that resonated most, was the idea of an agent as an &lt;strong&gt;orchestrator of specialized tools&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It's not about one giant model doing everything. It's about a reasoning engine that knows how to solve a complex problem by breaking it down and delegating tasks to the best "specialist" for the job. This shift from a monolithic to a modular, tool-centric mindset was my biggest "aha!" moment.&lt;/p&gt;

&lt;h2&gt;
  
  
  How My Understanding Evolved: The Power of the "Worker Agents"
&lt;/h2&gt;

&lt;p&gt;The course's deep dive into &lt;strong&gt;Multi-Agent Systems&lt;/strong&gt; (Day 1) and &lt;strong&gt;Tools/MCP&lt;/strong&gt;(Day 2) was a game-changer. I stopped thinking about building a single, all-powerful agent and started thinking about creating a team of "worker agents" managed by a "coordinator".&lt;/p&gt;

&lt;p&gt;This led to a fundamental change in my approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Before: "How can I prompt a model to generate a story, an image, and audio?"&lt;/li&gt;
&lt;li&gt;After: "How can a Coordinator Agent manage three Specialized Agents—a Writer (Gemini), an Illustrator (Flux.1), and a Narrator (OpenAI TTS)—to work in parallel and deliver a result faster and more efficiently?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This evolution in understanding was the direct inspiration for my capstone project.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Capstone Project: 🦁 Curiosity Storybook
&lt;/h2&gt;

&lt;p&gt;For the capstone, I built &lt;strong&gt;Curiosity Storybook&lt;/strong&gt;, an AI agent for the "Agents for Good" track that transforms a child's "Why?" into a magical, multi-sensory learning experience.&lt;/p&gt;

&lt;p&gt;Instead of a dry answer, it generates a complete, personalized storybook page with a story, an illustration, and an audio narration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1udh34e9aesf6dpmurtx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1udh34e9aesf6dpmurtx.jpg" alt="Curiosity Storybook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Silvestre-PO/Curiosity-Storybook" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.youtube.com/watch?v=22J-rUAjF9Y" rel="noopener noreferrer"&gt;Youtube video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This project is a demonstration of how the most advanced concepts from the course can create a seamless and magical experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;General Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Frontend (UI/UX)&lt;/strong&gt;: A kid-friendly interface built with Gradio, hosted on Hugging Face Spaces.&lt;br&gt;
&lt;strong&gt;2. Agent Orchestrator&lt;/strong&gt;: A main agent managed with Blaxel that uses Gemini 2.5 Pro for reasoning and content generation.&lt;br&gt;
&lt;strong&gt;3. Tools&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A custom MCP (Model Context Protocol) server that exposes tools for specific tasks like narration.&lt;/li&gt;
&lt;li&gt;Direct calls to heavy-compute services for long-running tasks like image generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. AI Models&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google Gemini 2.5 Pro: For generating the main story and the illustration prompt.&lt;/li&gt;
&lt;li&gt;Flux.1-schnell: For high-quality image generation.&lt;/li&gt;
&lt;li&gt;OpenAI TTS: For audio narration.&lt;/li&gt;
&lt;li&gt;Hyperbolic (Llama 3.3): For ultra-fast generation of related questions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nbuzqhmehy434wwz4so.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nbuzqhmehy434wwz4so.png" alt="General Architecture" width="800" height="569"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned by Building It
&lt;/h2&gt;

&lt;p&gt;Building this project was where the concepts from the course truly clicked.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Agent Systems are Practical, Not Just Theoretical&lt;/strong&gt;: My project implements a &lt;strong&gt;Coordinator/Specialist&lt;/strong&gt; pattern. A main agent in Blaxel orchestrates three parallel tasks, each handled by a specialized model. Watching the story, image, and audio generate concurrently was proof of how powerful this architecture is for user experience.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Engineering is the Secret Sauce&lt;/strong&gt;: Day 3's lesson on Context Engineering was crucial. I implemented a ConversationContext class that uses &lt;strong&gt;compaction&lt;/strong&gt; (summarizing history) to feed a "Question Suggester" agent (Hyperbolic). This allows the agent to suggest relevant follow-up questions without needing the entire conversation transcript, making it fast and efficient. It's the feature that makes the experience feel like a continuous journey of discovery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability Isn't an Afterthought&lt;/strong&gt;: The "Agent Quality" lesson (Day 4) pushed me to integrate basic observability from the start. I implemented &lt;strong&gt;logging&lt;/strong&gt; for all tool calls and &lt;strong&gt;tracing&lt;/strong&gt; (by passing a session_id) to follow a request from start to finish. When the image generation failed once, I could pinpoint the exact step, proving the value of this pillar immediately.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The AI Agents Intensive course was more than a series of lectures; it was a fundamental shift in my mental model of what AI can do. It moved me from thinking about "prompts" to thinking about "systems". My understanding has evolved from seeing agents as simple interfaces to seeing them as complex, problem-solving engines. And "Curiosity Storybook" is the tangible result of that journey.&lt;/p&gt;

</description>
      <category>googleaichallenge</category>
      <category>ai</category>
      <category>agents</category>
      <category>devchallenge</category>
    </item>
  </channel>
</rss>
