xbill for Google Developer Experts

Posted on Jun 21 • Originally published at xbill999.Medium on Jun 21

MTP Speculative Decoding with the 12B Gemma 4 QAT Model on NVIDIA L4, Cloud Run, MCP, and…

#antigravity #mcps #speculativedecoding #qat

MTP Speculative Decoding with the 12B Gemma 4 QAT Model on NVIDIA L4, Cloud Run, MCP, and Antigravity CLI

This article provides a step by step deployment guide for Gemma 4 to a Google Cloud Run hosted GPU enabled system. The latest 12B model is used with QAT formats and deployed with MTP Speculative Decoding. A suite of Python MCP tools is built to simplify management of the vLLM hosted Gemma 4 deployment with Antigravity CLI.

What is this project trying to Do?

This project is a DevOps/SRE assistant that uses a Gemma 4 model hosted on Cloud Run with GPU. It provides tools to provision the Docker container and deploy the model, as well as for observability and performance testing.

This project is similar to a previous project that targeted GPU hosted Gemma4 instances on GCP:

Gemma-SRE: Self-Hosted vLLM Infrastructure Agent

Antigravity CLI

Antigravity CLI is the follow-on successor to Gemini CLI- the terminal driven, agent assisted coding tool.

Full details on installing Antigravity CLI are here:

Getting Started with Antigravity CLI

Testing the Antigravity CLI Environment

Once you have all the tools in place- you can test the startup of Antigravity CLI.

You will need to authenticate with a Google Cloud Project or your Google Account:

agy

This will start the interface:

Full Installation Instructions

The detailed installation instructions for Antigravity CLI are here:

Getting Started with Antigravity CLI

Python MCP Documentation

The official GitHub Repo provides samples and documentation for getting started:

GitHub - modelcontextprotocol/python-sdk: The official Python SDK for Model Context Protocol servers and clients

Where do I start?

The strategy for starting MCP development for model management is a incremental step by step approach.

First, the basic development environment is setup with the required system variables, and a working Antigravity CLI configuration.

Then, a minimal Python MCP Server is built with stdio transport. This server is validated with Antigravity CLI in the local environment.

This setup validates the connection from Antigravity CLI to the local server via MCP. The MCP client (Antigravity CLI) and the Python MCP server both run in the same local environment.

Setup the Basic Environment

At this point you should have a working Python environment and a working Antigravity CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemma4-tips

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

gpu-12B-qat-mtp-L4-devops-agent
source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

gpu-12B-qat-mtp-L4-devops-agent
source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Model Management Tool with MCP Stdio Transport

One of the key features that the standard MCP libraries provide is abstracting various transport methods.

The high level MCP tool implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

The simplest transport that the SDK supports is the stdio (stdio/stdout) transport — which connects a locally running process. Both the MCP client and MCP Server must be running in the same environment.

The connection over stdio will look similar to this:

# Initialize FastMCP server
mcp = FastMCP("Self-Hosted vLLM DevOps Agent")

Running the Python Code

First- switch the directory with the Python version of the MCP sample code:

~/gemma4-tips/gpu-12B-qat-mtp-L4-devops-agent

Run the release version on the local system:

make install
Processing ./.

The project can also be linted:

xbill@penguin:~/gemma4-tips/gpu-12B-qat-mtp-L4-devops-agent$ make lint
ruff check .
All checks passed!
ruff format --check .
6 files already formatted
mypy .
Success: no issues found in 6 source files

Getting Started with Gemma 4 on GPU

The Official vLLM repo also has Gemma4 specific information:

Releases · vllm-project/vllm

Gemma 12B

Gemma 12B is the latest model in the Gemma4 series. More details are available here:

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

and here:

Gemma 4 12B: The Developer Guide

NVIDIA L4 on Cloud Run

The L4 on Google Cloud Run provides a cost-effective “scale to zero” option for providing inference:

Google Cloud Run Adds Support for NVIDIA L4 GPUs, NVIDIA NIM, and Serverless AI Inference Deployments at Scale | NVIDIA Technical Blog

Quantization-Aware Training (QAT)

One of the key issues with deploying models is available GPU memory based in the hardware footprint. Along with the new 12B model release- new versions of the model that use QAT to reduce the memory requirements have been released:

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

QAT Model Versions

What are the differences between the versions on Hugging Face?

> what is the difference between google/gemma-4-12B-it-qat-q4_0-unquantized and https://huggingface.co/google/gemma-
  4-12B-it-qat-q4_0-gguf and google/gemma-4-12B-it-qat-w4a16-ct

▸ Thought for 1s, 602 tokens
  Analyzing Gemma QAT Variations

● Read(/home/xbill/gemma4-tips/gpu-12B-qat-L4-devops-agent/GEMINI.md) (ctrl+o to expand)

  Here is a comparison of the three Gemma 4 QAT (Quantization-Aware Training) model configurations:

  ### 1. google/gemma-4-12B-it-qat-q4_0-unquantized 

  • Format: Unquantized, high-precision weights (usually float16 or bfloat16) extracted directly from the QAT
  training pipeline.
  • Target Engine: Standard deep learning/fine-tuning framework runtimes (e.g., PyTorch, Transformers, Jax).
  • Primary Use Case:
      • Serving as a high-precision reference baseline.
      • Custom downstream quantization, compilation, or MLX conversion.
      • Research, fine-tuning, or running Speculative Decoding (where this serves as the primary model alongside
      its matching MTP assistant/draft model).


  ### 2. google/gemma-4-12B-it-qat-q4_0-gguf 

  • Format: GGUF (specifically quantized using the standard q4_0 4-bit method).
  • Target Engine: Local CPU, Apple Silicon, or consumer GPU engines (e.g., llama.cpp , Ollama, LM Studio).
  • Primary Use Case:
      • Zero-setup local execution with minimal memory requirements.
      • Running LLMs efficiently on consumer-grade hardware.


  ### 3. google/gemma-4-12B-it-qat-w4a16-ct 

  • Format: Compressed Tensors format ( -w4a16-ct ), which stores 4-bit weights while performing computation using
  16-bit activations.
  • Target Engine: High-performance, high-throughput cloud inference engines (specifically vLLM or SGLang ).
  • Primary Use Case:
      • High-concurrency cloud deployments utilizing GPU accelerators (e.g., NVIDIA L4).
      • Optimizing GPU memory usage and improving inference speed while maintaining near-lossless model accuracy
      via QAT.

  ──────
  ### 📊 Summary Reference Table

   Checkpoint Configuration | Serialization Format | Target Platform | Primary Engine
  ---------------------------|---------------------------|----------------------------|----------------------------
    ...-qat-q4_0-unquantized | Unquantized ( BF16 / FP16 | Custom compilation / | Transformers / Speculative
                             | ) | research | Drafts
    ...-qat-q4_0-gguf | GGUF (4-bit weight) | Local Desktop/Server | llama.cpp / LM Studio /
                             | | (CPU/GPU) | Ollama
    ...-qat-w4a16-ct | compressed-tensors (4- | Cloud GPU Deployments | vLLM / SGLang 
                             | bit/16-bit) | |

  For more details on QAT architecture and configuration, see GEMINI.md.
  ──────
  ### Summary of Work

  • Reviewed Context: Analyzed the workspace documentation GEMINI.md to compare the unquantized, GGUF, and
  compressed-tensors configurations.

MCP stdio Transport

One of the key features that the MCP protocol provides is abstracting various transport methods.

The high level tool MCP implementation is the same no matter what low level transport channel/method that the MCP Client uses to connect to a MCP Server.

In this project Antigravity CLI is used as the MCP client to interact with the Python MCP server code.

Antigravity CLI mcp_config.json

A sample MCP server file is provided in the .agents directory:

{
  "mcpServers": {
    "gpu-devops-agent": {
      "command": "python3",
      "args": [
        "/home/xbill/gemma4-tips/gpu-12B-qat-mtp-L4-devops-agent/server.py"
      ],
      "env": {
        "GOOGLE_CLOUD_PROJECT": "aisprint-491218",
        "GOOGLE_CLOUD_LOCATION": "us-east4",
        "VLLM_BASE_URL": "https://gpu-12b-qat-mtp-289270257791.us-east4.run.app",
        "MODEL_NAME": "/mnt/models/gemma-4-12B-it-qat-w4a16-ct"
      }
    }
  }
}

Validation with Antigravity CLI

The final connection test uses Antigravity CLI as a MCP client with the Python code providing the MCP server:

MCP Servers

Plugins (~/.gemini/antigravity-cli/plugins)
> ✓ google-dev-knowledge Tools: search_documents, answer_query, get_documents
   ✓ gpu-devops-agent Tools: save_hf_token, get_vllm_endpoint, list_vertex_models, list_bucket_models,
                       analyze_cloud_logging, +21 more

Checking System status

The status can be checked with an MCP tool:

> vllm_status

● gpu-devops-agent/status_vllm(Checking status of the vLLM deployment via MCP) (ctrl+o to expand)

  Here is the current status of the gpu-12b-qat-mtp Cloud Run deployment:

  • Service Status: 🟢 Ready (all configurations and routing conditions are met)
  • Latest Revision: gpu-12b-qat-mtp-00032-nwb (deployed on 2026-06-21)
  • Service URL: https://gpu-12b-qat-mtp-wgcq55zbfq-uk.a.run.app

Cross Check The Deployed Model

Once the model starts locally- the MCP tool allows for model verification:

> verify_model_health

● gpu-devops-agent/verify_model_health(Verifying model health for gpu-12b-qat-mtp via MCP) (ctrl+o to expand)

  The model health check has passed successfully:

  • Model: /mnt/models/gemma-4-12B-it-qat-w4a16-ct 
  • Response: "Yes, I am working! I am ready to help you with any..." 
  • Latency: 0.88 seconds

and model details:

> get_active_models

131.0k in | 677 out | 415 thinking
● Bash(ENDPOINT="https://gpu-12b-qat-mtp-wgcq55zbfq-uk.a.run.app"; TOKEN=$(gcloud auth pri...) (ctrl+o to
expand)

  The active model on the deployment is:

  • Model ID: /mnt/models/gemma-4-12B-it-qat-w4a16-ct 
  • Owned by: vllm 

  (Note: The /health check returned a 503 in get_model_details because the request was
  unauthenticated, and Cloud Run requires an IAM identity token for routing. When queried with the proper
  authorization headers, the endpoint returns a clean 200 OK health status).

and status:

Here is the current system status:

  ### 🌀 GPU Cloud Run System Status ( get_system_status )

  • vLLM Health: 🟢 Online ( [https://gpu-12b-qat-l4-devops-agent-289270257791.us-east4.run.app](https://gpu-12b-qat-l4-devops-agent-289270257791.us-east4.run.app) )
  • Cloud Run Service Status: 🟢 Ready
  ──────
  ### 🧩 Active Model Details ( get_model_details )

  • Active Model: /mnt/models/gemma-4-12B-it-qat-w4a16-ct 
  • Health Endpoint: Healthy ✅
  ──────

Review the Model

Check the model details:

> what are the model details including vllm optimization

  Here are the model and vLLM optimization details for the active gpu-12b-qat-mtp deployment:

  ### 🧩 Core Model Details

  • Primary Model: google/gemma-4-12B-it-qat-w4a16-ct (Quantization-Aware Training, INT4 weights with 16-
  bit activations)
  • Speculative/Assistant Model (MTP): google/gemma-4-12B-it-assistant (Multi-Token Prediction)
  • Speculative Tokens: 3 ( --spec-tokens=3 , --spec-method=mtp )
  ──────
  ### 🚀 vLLM Performance Optimizations

   Category | Optimization Parameter | Value/Setting | Purpose
  ---------------------|---------------------------------|---------------|----------------------------------
   Inference Kernel | VLLM_DISABLE_FLASHINFER | 0 (Enabled) | Uses high-performance FlashInfer
                       | | | kernels for attention
                       | | | operations.
   Sampler | VLLM_USE_FLASHINFER_SAMPLER | 1 (Enabled) | Enables FlashInfer's optimized
                       | | | sampling kernels, decreasing
                       | | | time-to-first-token.
   Memory Allocation | PYTORCH_CUDA_ALLOC_CONF | expandable_se| Minimizes VRAM memory
                       | | gments:True | fragmentation and helps prevent
                       | | | Out-Of-Memory (OOM) errors.
   KV Cache | --kv-cache-dtype | fp8 | Quantizes key-value cache to 8-
                       | | | bit precision, doubling the
                       | | | available context capacity.
   KV Cache Allocation | --gpu-memory-utilization | 0.95 | Allocates 95% of GPU VRAM to
                       | | | vLLM's storage caches.
   Batching & Context | --max-num-batched-tokens | 8192 | Sets maximum batched prefill
                       | | | tokens to allow higher
                       | | | throughput under concurrency.
   Chunking | --enable-chunked-prefill | Enabled | Chunks massive requests to avoid
                       | | (Implicit) | stalling shorter requests in
                       | | | queue.
   Load Strategy | --safetensors-load-strategy | prefetch | Prefetches model weights during
                       | | | container boot-up to reduce
                       | | | startup time.
   Scheduling | --async-scheduling | Enabled | Enables asynchronous scheduling
                       | | | loop in vLLM to pipeline
                       | | | execution.
   Context Limit | --max-model-len | 32768 | Limits maximum context length to
                       | | | keep memory reservation stable.
   Concurrency Limit | --max-num-seqs | 8 | Caps maximum concurrent request
                       | | | sequences active in a forward
                       | | | pass.
   Threading | MKL_NUM_THREADS , | 1 | Forces single-threading on CPU
                       | OMP_NUM_THREADS | | to avoid CPU core contention on
                       | | | the host.

───────────────────────────────────────────────────────────────────────────────────────────────────────────────
>

and check the metrics:

  Here is a summary of the key metrics retrieved from the active gpu-12b-qat-mtp vLLM engine:

  ### 🚀 Performance & Throughput

  • Prefill Tokens Processed: 29,389,524 tokens (prefilled)
  • Generation Tokens Processed: 9,494 tokens (generated)
  • Active Running Requests: 0.0 (Idle)
  • Active Waiting Requests: 0.0 
  • KV-Cache Usage: 0.0% (Idle)

  ### 🔮 Speculative Decoding / MTP Statistics

  • Total Draft Iterations: 139 
  • Total Speculative/Draft Tokens Proposed: 417 tokens
  • Total Speculative Tokens Accepted: 23 tokens
  • Speculative Acceptance by Position:
      • Position 0: 11 tokens accepted
      • Position 1: 6 tokens accepted
      • Position 2: 6 tokens accepted
      • Overall Acceptance Rate: ~5.5% (demonstrating why MTP is heavily bound by scheduling and sequential
      evaluation overhead on a single GPU for typical workloads)

  ### 🕒 Latency & Queuing (Cumulative)

  • Average Time to First Token (TTFT): ~0.28 seconds
  • Average Request Queue Time: ~0.016 milliseconds (minimal queue delay)
  • Average End-to-End Latency: ~0.28 seconds (highly responsive)

Once everything is online — you can send queries to the Model:

● gpu-devops-agent/query_gemma4_with_stats(Querying Gemma 4 via MCP with stats returned) (ctrl+o to expand)

  Here are the performance stats and the model response to your query:

  ### 📊 Performance Statistics

  • Active Model: /mnt/models/gemma-4-12B-it-qat-w4a16-ct 
  • Time to First Token (TTFT): 0.215 seconds
  • Total Generation Time: 12.60 seconds
  • Generation Speed: 18.16 tokens/s
  • Total Generated Tokens: ~225 tokens
  ──────
  ### 💬 Model Response

────────────────────────────────────────────────────────────
  Depending on the context, Gemma usually refers to one of two things: Google's family of open AI models, or
  a gemstone.

  #### 1. In Artificial Intelligence (Google’s AI Models)

  Gemma is a family of lightweight, state-of-the-art open models developed by Google DeepMind.

  It is built using the same research and technology as Google's larger Gemini models, but is designed to be
  small and efficient enough for developers to run on their own hardware (such as laptops or single-GPU
  servers).

  • Open Weights: The model weights are publicly available, allowing developers to fine-tune, download, and
  host them privately.
  • Sizes: Available in various sizes (e.g., 2B, 7B, 9B, 27B parameters, and the latest 12B/31B iterations
  in Gemma 4) to balance capability and hardware constraints.
  • Versatility: Capable of text generation, summarization, coding, and logical reasoning.

  #### 2. In Gemology & Names

  • Gemstone: "Gemma" is the Latin word for a precious gemstone or jewel.
  • Name: A popular feminine given name in several European languages (especially Italian), translating to
  "precious stone" or "gem."

Benchmark the Local Model

The local model was benchmarked:

The benchmark sweep has completed successfully. Here are the compiled results across all combinations of context
  window sizes and concurrency levels:

The analysis is reviewed:

> summarize benchmark results

> summarize the benchmark results
  Here is a summary of the benchmark results comparing the Standard QAT (No MTP) configuration against
  various Multi-Token Prediction (MTP) settings on a single NVIDIA L4 GPU:
  ──────
  ### 1. Standard Workloads (Short Contexts, 128 Tokens)

  For typical developer and agent workloads, MTP/Speculative Decoding degrades performance.

  • Why? The sequential overhead of running the draft model ( google/gemma-4-12B-it-assistant ) on a single
  GPU offsets the memory bandwidth savings when the primary model is already highly optimized.
  • Optimized MTP-3 Impact: By enabling FlashInfer and increasing max-num-batched-tokens to 8192 , we
  recovered significant throughput under high concurrency. At 512 concurrent users, throughput for MTP-3
  improved from 8.4 req/s to 12.84 req/s (a +52.8% increase).

  Throughput Comparison (Requests/sec at 128 context):

   Concurrency (Users) | Standard QAT (No MTP) | MTP-4 | MTP-3 | MTP-2 | Optimized MTP-3 (Tuned)
  ---------------------|-----------------------|-----------|-----------|----------|-------------------------
            8 | 18.9 | 11.3 | 8.8 | 8.0 | 8.53
           64 | 23.8 | 12.9 | 11.9 | 11.9 | 12.12
           512 | 23.7 | 13.0 | 8.4 | 8.0 | 12.84
  ──────
  ### 2. Large Context Workloads (16,384 Tokens)

  At very large context lengths, inference becomes heavily memory-bandwidth bound due to the massive size of
  the KV cache. Here, MTP improves performance.

  • Why? Speculative decoding allows the GPU to evaluate multiple tokens per target model forward pass,
  reducing the frequency of loading the giant KV cache from High Bandwidth Memory (HBM) to GPU SRAM.
  • Best Configuration: MTP-3 (3 tokens) and MTP-4 yield the highest overall throughputs. Under high
  concurrency (512 users), MTP-3 outperforms Standard QAT by +11.4%. Under extreme load (2048 users),
  Optimized MTP-3 maintains a stable 6.66 req/s throughput.

  Throughput Comparison (Requests/sec at 16K context):

   Concurrency (Users) | Standard QAT (No MTP) | MTP-4 | MTP-3 | MTP-2 | Optimized MTP-3 (Tuned)
  ---------------------|-----------------------|-----------|-----------|----------|-------------------------
            8 | 5.16 | 6.55 | 0.15 | 0.17 | 0.14
           64 | 7.39 | 8.11 | 7.65 | 8.00 | 7.61
           512 | 7.54 | 8.24 | 8.40 | 7.70 | 8.13
          2048 | 6.30 | 8.00 | 7.55 | 8.01 | 6.66
  ──────
  ### 💡 Core Recommendations

  1. Production DevOps Agent (Default): Keep MTP Disabled. The standard QAT model ( google/gemma-4-12B-it-
  qat-w4a16-ct ) alone provides the highest throughput (up to 23.8 req/s) and lowest latencies for standard
  agent interactions.
  2. Long-Context Tasks (16K+): Enable MTP-3 or MTP-4 with FlashInfer and batched token tuning enabled to
  accelerate processing times and improve throughput by up to 27%.

Wait, What was that? I thought speculative decoding improved performance!

 why didn't MTP improve performance

  Multi-Token Prediction (MTP) / Speculative Decoding failed to improve performance for standard workloads
  (short context sizes) due to three key technical bottlenecks on the single NVIDIA L4 GPU configuration:

  ### 1. Compute-Bound vs. Memory-Bound Regimes

  • When MTP Helps (Memory-Bound): LLM decoding is typically memory-bandwidth bound because the GPU has to
  fetch the entire model's weights from High Bandwidth Memory (HBM) to SRAM for every single token generated.
  Speculative decoding speeds this up by using a tiny draft model to propose tokens, allowing the main model
  to verify them in one joint forward pass.
  • When MTP Fails (Compute-Bound): When you have multiple concurrent users, vLLM groups these requests into
  batches. As the batch size increases, the workload shifts from memory-bound to compute-bound (fully
  utilizing the Tensor Cores). In this regime, weight loading is no longer the bottleneck; instead, raw
  compute capability is. Running the draft model sequentially to generate speculative tokens adds extra
  compute steps to the queue, increasing latency and reducing throughput.

  ### 2. Draft Model Execution & Sequential Verification Overhead

  • Speculative decoding requires the GPU to run the assistant model sequentially (e.g., 3 draft steps)
  before running the target model once to verify them.
  • The Gemma 4 12B QAT model is already extremely fast natively on the L4 GPU. Because of this, the
  overhead of running the draft model sequentially is disproportionately high compared to the target model.
  If even a few proposed tokens are rejected, the wasted draft passes result in a net-negative speedup.

  ### 3. VRAM Contention and Scheduler Limits

  • Assistant Model Overhead: Loading the speculative draft model ( google/gemma-4-12B-it-assistant )
  consumes about 0.8 GB of VRAM. While small, this reduces the space available for vLLM's KV cache.
  • Token Budget Contention: Speculative decoding requires reservation of extra token slots in the scheduler.
  As vLLM warns, this limits the maximum concurrent sequences the engine can schedule per step, causing
  requests to queue up and increasing time-in-queue under high concurrency.

So what are the bottlenecks?

> what are the key constraints for the deployment on this hardware profile

  Deploying serving stacks on this specific hardware profile—a single NVIDIA L4 GPU (24GB VRAM) combined
  with Cloud Run Gen2 (4 vCPUs, 16GB System RAM)—introduces several critical constraints:

  ### 1. GPU VRAM Limitations (24 GB)

  • Model Footprint vs. KV Cache: The L4's 24GB VRAM is the primary ceiling. The baseline Gemma 4 12B model
  in half-precision (bfloat16) consumes ~24GB of memory, leaving 0 GB for the KV cache and causing immediate
  Out-of-Memory (OOM) crashes under concurrent load.
  • The QAT Requirement: To serve the model, we must use the Quantization-Aware Training (QAT) 4-bit
  compressed format ( w4a16-ct ). This compresses the model weights to ~6 GB, freeing up ~18 GB for the KV
  cache.
  • Speculative Model Tax: Enabling MTP adds the assistant model weights to VRAM (~0.8 GB), further
  squeezing the KV cache space.

  ### 2. GPU Memory Bandwidth (300 GB/s)

  • Compared to H100 (3.35 TB/s) or A100 (2.0 TB/s), the L4 has relatively low memory bandwidth (300 GB/s).
  This makes single-user, memory-bound autoregressive decoding natively slower, which is why optimizing the
  kernels (FlashInfer) and memory alignment is critical.

  ### 3. Cloud Run CPU Throttling ( --no-cpu-throttling )

  • Cloud Run Gen2 defaults to throttling CPU down to near-zero when no requests are actively being
  processed. For GPU serving, this must be disabled ( --no-cpu-throttling ). If CPU throttling is active,
  the container's CPU cannot coordinate with the GPU quickly enough, resulting in severe timeouts, failed
  startup health probes, and model-loading hangs.

  ### 4. GCS FUSE UID/GID Permissions

  • Model weights are mounted from Google Cloud Storage (GCS) using Cloud Run's FUSE volume mount.
  • The official vllm/vllm-openai container runs as the non-root user vllm (UID/GID 1001 ). If you do
  not explicitly mount the GCS volume with the options mount-options=uid=1001;gid=1001 , the container will
  fail to boot with a Permission Denied error because the non-root user cannot read the mounted model
  files.

  ### 5. System RAM (16 GB) and vCPUs (4)

  • The host system only has 16 GB of System RAM and 4 vCPUs. Loading model weights from the GCS mount into
  the GPU requires buffering through system memory. During boot, vLLM's initialization and model loading
  must be carefully managed (e.g. using safetensors-load-strategy=prefetch ) to prevent exceeding the
  host's 16GB memory ceiling, which triggers immediate container evictions (OOMKilled).

Cost Analysis

OK so there are some limitations with this hardware profile. What about the cost benefits?

> how cost effective is this deployment?

  This deployment profile (Cloud Run Gen2 GPU + NVIDIA L4) is exceptionally cost-effective for SRE/DevOps
  automation workloads, primarily due to its serverless architecture, hardware efficiency, and the optimized
  model format.
  Here is a breakdown of the cost dynamics:
  ### 1. Serverless Scale-to-Zero Advantage

  Unlike traditional setups on VMs (Compute Engine) or Kubernetes (GKE) where you pay for the GPU 24/7 even
  when idle, Cloud Run allows you to configure min-instances=0 .

  • Idle Cost: $0 when no SRE alerts or queries are active.
  • Cold Start Consideration: While scaling to zero saves maximum budget, it incurs a cold start penalty
  (~1–2 minutes to mount GCS weights and boot vLLM). For critical SRE workloads, keeping min-instances=1 
  is recommended, which costs a fixed baseline but is still cheaper than full VM hosting due to Cloud Run's
  granular allocation.

  ### 2. High Performance-per-Dollar of the L4 GPU

  The NVIDIA L4 (based on the Ada Lovelace architecture) is designed specifically for low-cost, energy-
  efficient inference:

  • Cloud Run GPU Pricing: Estimated at roughly $0.40 to $0.50 per hour for the GPU resource alone, plus
  minimal CPU/Memory allocation.
  • Total Hourly Cost (Active): Approximately $0.60 to $0.70/hour when running.
  • Comparison to A100/H100: An A100 instance generally costs $2.00 to $3.50/hour. Since the optimized Gemma
  4 12B QAT model achieves a highly respectable ~24 req/s on a single L4, you get a massive percentage of
  A100-level throughput at a fraction of the hardware cost.

  ### 3. Efficiency Gains from Quantization (QAT)

  Using the QAT INT4 model ( w4a16-ct ) directly improves cost-effectiveness:

  • Memory Reduction: Compressing the weights from 24GB to 6GB allows the model to fit comfortably on a
  single 24GB L4 GPU.
  • Avoiding Multi-GPU Costs: Without quantization, serving a 12B model in bfloat16 would require either a
  much more expensive GPU (like an A100 40GB/80GB) or multiple L4 GPUs in a tensor-parallel setup, which
  would double or triple the hourly cost.

  ### 4. Summary Financial Comparison

   Metric | Dedicated VM (e.g., g2-… | Cloud Run (Scale-to-Zero… | Cloud Run (Always-On, mi…
  ----------------------|--------------------------|---------------------------|----------------------------
   Hourly Rate (Active) | ~$0.70 | ~$0.70 | ~$0.70
   Monthly Cost | ~$504.00 (fixed) | ~$42.00 (usage-based) | ~$504.00 (max baseline)
   Best Fit | Continuous 24/7 high- | Intermittent SRE alert | Real-time SRE response,
                        | throughput workloads | analysis, dev/testing | zero cold-start tolerance

───────────────────────────────────────────────────────────────────────────────────────────────────────────────

One last cross-check:

> one final review- MTP or not with this deployment? and what about QAT?

  • The Only Exception: If your agent is routinely processing massive datasets or long-context documents
  (16K+ tokens in a single request), MTP will provide a 10% to 27% speedup. Otherwise, the overhead is not
  worth the trade-off.
  ──────
  ### 2. Quantization-Aware Training (QAT): YES (Mandatory)

  Using the QAT compressed weights ( w4a16-ct ) is absolutely critical and mandatory for this hardware
  profile.

  • Why? A single NVIDIA L4 GPU has only 24GB of VRAM.
      • If you deploy the standard bfloat16 model, the weights consume almost all 24GB, leaving 0 GB for the
      KV cache. The service will crash with Out-of-Memory (OOM) errors as soon as multiple requests arrive.
      • The QAT INT4 model compresses the weights to ~6 GB, leaving ~18 GB of VRAM entirely for the KV cache.
      This allows the system to easily handle up to 512 concurrent users with a 100% success rate.
  • Zero Quality Loss: Because this is QAT (quantization simulation during training) rather than standard
  post-training quantization, you get the memory savings of a 4-bit model with the logical accuracy and
  reasoning capabilities of the full 16-bit baseline model.

  ### Final Production Stack Recommendation:

  │ Serve google/gemma-4-12B-it-qat-w4a16-ct using vLLM on a single L4 GPU with MTP disabled, and
  │ FlashInfer enabled for optimal throughput (~24 req/s) and maximum concurrent capacity.

> so MTP is bad?

  MTP is not bad in general—it is just wrong for this specific deployment use case.

  Think of MTP as a highly specialized optimization tool. It works exceptionally well in certain situations,
  but acts as a bottleneck in others.

  ### When MTP is Good:

  • Single-User / Local Environments: If you are running a model locally on your laptop (e.g., Apple Silicon
  │ Serve google/gemma-4-12B-it-qat-w4a16-ct using vLLM on a single L4 GPU with MTP disabled, and
  or a desktop GPU) for a single user, the system is strictly memory-bound. Here, MTP significantly
  accelerates token generation (often by 1.5x to 2x) because it reduces the time spent loading model weights
  from memory.
  • Extremely Long Contexts (16K+ tokens): When processing huge logs or long documents, loading the massive
  KV cache dominates the GPU's time. MTP reduces the frequency of these expensive memory transfers,
  resulting in a 10% to 27% throughput improvement even under high concurrency.

  ### When MTP is Bad (This DevOps Agent Profile):

  • High Concurrency (Batched Requests): When multiple users query the DevOps agent at the same time, vLLM
  batches the requests. This shifts the GPU from a memory-bound state to a compute-bound state (where Tensor
  Cores are running at 100%). In this state, the sequential execution of the draft model adds extra compute
  overhead, causing requests to queue up and slowing everything down.
  • Hardware Resource Constraints: On a single 24GB GPU, loading the draft model takes up precious VRAM and
  restricts the scheduling capacity of the main engine, causing throughput to degrade under standard log
  analysis sizes.

  ### Summary

  MTP is a latency optimizer for single-user, memory-bound tasks. It is not a throughput optimizer for
  concurrent, compute-bound backend APIs. Since DevOps/SRE agents are backend APIs that handle concurrent
  workloads, turning MTP off yields better overall performance.

Summary

The strategy for using MCP for Gemma 4 GPU deployment with Antigravity CLI was validated with a incremental step by step approach.

A minimal stdio transport MCP Server was started from Python source code and validated with Antigravity CLI running as a MCP client in the same local environment. This Python server provided all of the management tools to deploy and troubleshoot the speculative decoding options with the Gemma 4 model.

DEV Community